After collecting data in a FeedbackDataset using a MultiLabelQuestion, running prepare_for_training with setfit generates unreliable binarized label matrices where the label offset that corresponds to the label's position in the dataset labels property is broken.
If we have 4 labels: a,b,c,d and as part of multi-label annotation users populate a,b and d but c never receives any labels, we end up in a situaton where the system generates a binarized_label [1,1,1] where the final one index at y[2] presumably refers to d.
This means that once the model is trained it can be difficult or impossible to figure out how to map outputs back to the correct label again.
Stacktrace and Code to create the bug
import argilla as rg
from argilla.feedback import TrainingTask
testds = rg.FeedbackDataset(
fields=[rg.TextField(name='text')],
questions=[rg.MultiLabelQuestion(name='label', labels=['a','b','c','d'])]
)
testds.add_records([
rg.FeedbackRecord(fields={'text':'hello world 1!'}, responses=[rg.ResponseSchema(values={'label':rg.ValueSchema(value=['a','b'])})]),
rg.FeedbackRecord(fields={'text':'hello world 2!'}, responses=[rg.ResponseSchema(values={'label':rg.ValueSchema(value=['b'])})]),
rg.FeedbackRecord(fields={'text':'hello world 3!'}, responses=[rg.ResponseSchema(values={'label':rg.ValueSchema(value=['d'])})]),
rg.FeedbackRecord(fields={'text':'hello world 4!'}, responses=[rg.ResponseSchema(values={'label':rg.ValueSchema(value=['d','b','a'])})]),
])
output = testds.prepare_for_training('setfit', task=TrainingTask.for_text_classification(text=testds.field_by_name('text'), label=testds.question_by_name('label')))
output.to_pandas()
Should show something like this:
Obviously the 4th row here is nonsensical - the label column shows 3 as one of the labels but there is no index 3 in the binarized label - presumably the third column actually refers to label d?
Expected behavior
I think this is a complicated one that may need a little discussion. In my mind I would at least expect a warning to say "Hey, did you know that label c is not used?" and I guess we would still keep the same number of dimensions in the matrix even if one of them is always 0. I don't think there is a nicer way to do this unless you can somehow pass back the "new" mappings of offsets to labels.
Environment:
Client version 1.25.0
Additional context
I am currently working around this by manipulating the labels property in the MultiLabelQuestion before training like so (it is pretty yucky but it works):
Describe the bug
After collecting data in a FeedbackDataset using a MultiLabelQuestion, running
prepare_for_training
withsetfit
generates unreliable binarized label matrices where the label offset that corresponds to the label's position in the datasetlabels
property is broken.If we have 4 labels: a,b,c,d and as part of multi-label annotation users populate a,b and d but c never receives any labels, we end up in a situaton where the system generates a binarized_label
[1,1,1]
where the final one index at y[2] presumably refers to d.This means that once the model is trained it can be difficult or impossible to figure out how to map outputs back to the correct label again.
Stacktrace and Code to create the bug
Should show something like this:
Obviously the 4th row here is nonsensical - the
label
column shows3
as one of the labels but there is no index 3 in the binarized label - presumably the third column actually refers to labeld
?Expected behavior
I think this is a complicated one that may need a little discussion. In my mind I would at least expect a warning to say "Hey, did you know that label c is not used?" and I guess we would still keep the same number of dimensions in the matrix even if one of them is always
0
. I don't think there is a nicer way to do this unless you can somehow pass back the "new" mappings of offsets to labels.Environment:
Additional context I am currently working around this by manipulating the
labels
property in the MultiLabelQuestion before training like so (it is pretty yucky but it works):