argilla-io / argilla-plugins

🔌 Open-source plugins for with practical features for Argilla using listeners.
Apache License 2.0
6 stars 2 forks source link

use `classy-classification` for active learning #13

Open davidberenstein1957 opened 1 year ago

davidberenstein1957 commented 1 year ago

Ideally we would be able to easily host active learners in a more abstract and intuitive process.

MVP

from argilla_plugins.active_learning import classy_classification_learner

classy_classification_learner(name="dataset", model="bert", validation_threshold: int, min_n_samples: int, max_n_samples: int)
classy_classification_learner.start()

Stretch filtering variables like query could be added to limit the sync. Things like threshold could be added to pre-annotate and validate certain data.

davidberenstein1957 commented 1 year ago

only update predictions as predicted_by classy-classification

dvsrepo commented 1 year ago

Really excited to see this happening!

Regarding the max-number-examples I've been thinking about some scenarios related to continuous training and monitoring:

When we reach this limit, I understand we stop training, but we keep updating new records with the predictions of the model right? This is the scenario where user can send more data to the dataset and we use the model in the loop to label new data.

In the above scenario, if I already reached the limit and the users annotate more data, we will retrain the model with the newest annotations? I think you mention this to act as LIFO queue? In my mind it makes total sense. We shift the fewshot training set towards more recent examples.

dvsrepo commented 1 year ago

Not to over complicate things of course, just some quick thoughts about how powerful this could get!

davidberenstein1957 commented 1 year ago

@dvsrepo The plugin currently works by getting all annotated records, getting the fifo/lifo annotations and creating a training dataset for classy classification. This dataset with index i, is then applied every interval t to a batch of x records without annotation and which are queried where metadata.idx!=i. These records are updated if the prediction score has enough certainty and if the previous prediction is allowed to be over-written.

This approach ensures the plugin will keep updating predictions in the background whenever new data is annotated but that it doesn't take too long to infer the new knowledge.

dvsrepo commented 1 year ago

Looks awesome, looking forward to trying it out

davidberenstein1957 commented 1 year ago

Yes, me too. I need to write tests for edge-cases but I want to do these formal structural things after reviewing the entire concepts based on the PyData Bordeaux input.