Closed felixvor closed 3 years ago
Hi @DieseKartoffel The data format looks good to me (and it is the same as in our multilabel classification example: https://github.com/deepset-ai/FARM/blob/master/examples/doc_classification_multilabel.py except for the additional text_b input). Are you providing label_list = ["0","1","2"]
to the TextClassificationProcessor
? I will try to reproduce the error message on my side.
Hey Julian thank you for looking into this. Yes I tried to work close to your examples for debugging :-)
I made sure to use the correct labels, and if a label from the dataset is not part of label_list
, farm will already output a useful error and not start the training.
I also tried it with different labels and prepared corresponding datasets. For example I tried batch size 10 with label_list=["a","b","c","d","e"]
and got ValueError: Expected input batch_size (10) to match target batch_size (50)
.
So far, I could not replicate the error. Could you maybe share some code and a small data example? What I did so far is the following:
basic_texts
variable to contain pairs of texts by copying text
value to text_b
valuetext_b
column to the train.tsv
and val.tsv
dataset that contains the same text as column text
basic_texts = [
{"text": ("You ... ...", "You ... ...")},
{"text": ("What a lovely world", "What a lovely world")},
]
The output that I get is the following:
[{'task': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': "('You ... ...', 'You ... ...')", 'label': "['toxic', 'obscene', 'insult']", 'probability': array([0.93692017, 0.19396962, 0.8908834 , 0.10999262, 0.8351795 ,
0.2840815 ], dtype=float32)}, {'start': None, 'end': None, 'context': "('What a lovely world', 'What a lovely world')", 'label': '[]', 'probability': array([0.371408 , 0.00837683, 0.1528986 , 0.00711144, 0.16077891,
0.01845325], dtype=float32)}]}]
I was able to get it working by reproducing your approach step by step. I then compared the code to my project and found that I was using the wrong prediction head... Very easy solution which I should have spotted from the start... Thank you very much for your help!
Question Not sure if this is a bug or I am doing something wrong here. I am trying to train a model with multilabel classification and two text inputs (i.e. textpair).
I prepared an example dataset with the following format:
I found that using
TextPairClassificationProcessor
withmultilabel=True
seems to work fine to prepare the data for training, which I checked with the debugger. But on training start I get the following error:In the last line '16' is my batch size and the '48' is exactly
batch_size * num_prediction_head_outputs
(tested this with different batch sizes and label lists). I came to my limits when trying to debug your code for the training and loss calculation process and was wondering if you could help me find a solution. Is farm suitable to do multilabel with textpair? I would like to contribute and make this use case more accessible, but currently I do not know where to start looking for a fix. Maybe you have an idea? Any help would be appreciated :)