argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.89k stars 363 forks source link

[BUG-python/deployment] prepare_for_training('setfit') silently removes labels with no records from binarized vectors #4680

Closed ravenscroftj closed 2 months ago

ravenscroftj commented 6 months ago

Describe the bug

After collecting data in a FeedbackDataset using a MultiLabelQuestion, running prepare_for_training with setfit generates unreliable binarized label matrices where the label offset that corresponds to the label's position in the dataset labels property is broken.

If we have 4 labels: a,b,c,d and as part of multi-label annotation users populate a,b and d but c never receives any labels, we end up in a situaton where the system generates a binarized_label [1,1,1] where the final one index at y[2] presumably refers to d.

This means that once the model is trained it can be difficult or impossible to figure out how to map outputs back to the correct label again.

Stacktrace and Code to create the bug


import argilla as rg
from argilla.feedback import TrainingTask

testds = rg.FeedbackDataset(
    fields=[rg.TextField(name='text')], 
    questions=[rg.MultiLabelQuestion(name='label', labels=['a','b','c','d'])]
)

testds.add_records([
    rg.FeedbackRecord(fields={'text':'hello world 1!'}, responses=[rg.ResponseSchema(values={'label':rg.ValueSchema(value=['a','b'])})]),
    rg.FeedbackRecord(fields={'text':'hello world 2!'}, responses=[rg.ResponseSchema(values={'label':rg.ValueSchema(value=['b'])})]),
    rg.FeedbackRecord(fields={'text':'hello world 3!'}, responses=[rg.ResponseSchema(values={'label':rg.ValueSchema(value=['d'])})]),
    rg.FeedbackRecord(fields={'text':'hello world 4!'}, responses=[rg.ResponseSchema(values={'label':rg.ValueSchema(value=['d','b','a'])})]),
])

output = testds.prepare_for_training('setfit', task=TrainingTask.for_text_classification(text=testds.field_by_name('text'), label=testds.question_by_name('label')))
output.to_pandas()

Should show something like this:

image

Obviously the 4th row here is nonsensical - the label column shows 3 as one of the labels but there is no index 3 in the binarized label - presumably the third column actually refers to label d?

Expected behavior

I think this is a complicated one that may need a little discussion. In my mind I would at least expect a warning to say "Hey, did you know that label c is not used?" and I guess we would still keep the same number of dimensions in the matrix even if one of them is always 0. I don't think there is a nicer way to do this unless you can somehow pass back the "new" mappings of offsets to labels.

Environment:

Additional context I am currently working around this by manipulating the labels property in the MultiLabelQuestion before training like so (it is pretty yucky but it works):

import itertools
from collections import Counter

question = rds.pull().question_by_name('label')

ds = rds.format_as('datasets')
ds_df = ds.to_pandas()
label_count = Counter(list(itertools.chain(*ds_df['label'].apply(lambda x: x[0]['value']))))

new_q = question.copy()
new_q.labels = sorted(label_count.keys())

len(new_q.labels )
# gives 16

len(rds.question_by_name('label').labels)
# gives 18

setfit_ds = argds.prepare_for_training('setfit', TrainingTask.for_text_classification(text=rds.field_by_name('text'), label=new_q), train_size=0.7, seed=42)
github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 30 days since being marked as stale.