Several issues with TARS multilabel scores

lucaventurini commented 3 years ago

Hi!

I have been trying the tutorial for TARS, and noticed some unexpected behaviours.

Usually, when I want to retrieve multilabel scores for all the classes, I do something like clf.multi_label_threshold = 0.0000001.

If I do it with a pretrained task, sometimes I get the expected behaviour:

# 3. Switch to a particular task that exists in the above list
tars.switch_to_task("GO_EMOTIONS")

# 4. Prepare a test sentence
sentence = Sentence("I absolutely love this!")
tars.multi_label_threshold = 0.0000001 # to get all labels' scores
tars.predict(sentence)
print(sentence)

Sentence: "I absolutely love this !"   [− Tokens: 5  − Sentence-Labels: {'label': [NEUTRAL (0.0033), ANGER (0.0008), FEAR (0.0001), ANNOYANCE (0.0004), SURPRISE (0.0006), GRATITUDE (0.0022), DESIRE (0.0015), OPTIMISM (0.0008), ADMIRATION (0.186), CONFUSION (0.0001), AMUSEMENT (0.0026), APPROVAL (0.0314), CARING (0.0006), EMBARRASSMENT (0.0001), REALIZATION (0.0015), DISAPPOINTMENT (0.0004), GRIEF (0.0001), SADNESS (0.0002), CURIOSITY (0.0001), JOY (0.139), LOVE (0.9833), EXCITEMENT (0.2807), DISAPPROVAL (0.0003), REMORSE (0.0), DISGUST (0.0002), RELIEF (0.0002), PRIDE (0.0022), NERVOUSNESS (0.0002)]}]

sometimes I get only one label:

# 3. Switch to a particular task that exists in the above list
tars.switch_to_task("IMDB")

# 4. Prepare a test sentence
sentence = Sentence("I absolutely love this!")
tars.multi_label_threshold = 0.0000000001 # to get all labels' scores
tars.predict(sentence)
print(sentence)

Sentence: "I absolutely love this !"   [− Tokens: 5  − Sentence-Labels: {'label': [positive movie review (0.1077)]}]

(btw, this score seems a bit low in this case. but it's not the topic of this issue)

This happens also for the ZS case:

# 1. Load our pre-trained TARS model for English
tars = TARSClassifier.load('tars-base')

# 2. Prepare a test sentence
sentence = Sentence("I absolutely love this!")

# 3. Define some classes that you want to predict using descriptive names
classes = ["positive review", "positive sentiment", "negative review", "negative sentiment"]
tars.multi_label_threshold = 0.0000001 # to get all labels' scores

#4. Predict for these classes
tars.predict_zero_shot(sentence, classes)

# Print sentence with predicted labels
print(sentence)

Sentence: "I absolutely love this !"   [− Tokens: 5]

In the case above I didn't get even a single score.

Also, it has happened some times that setting multilabel=True, I get some repeated predictions for the majority label (i.e. the class and the score have a duplicated tuple in the array of sentence labels), but this has happened a bit randomly so I cannot show you how to reproduce. Also in this case, I didn't get the scores for the other labels as I had liked.

kishaloyhalder commented 3 years ago

Hi @lucaventurini

Thanks for trying out TARS and reaching out about this issue. Let me try to explain how to go about both the issues:

Regarding not being able to see all the classes while switching to "IMDB" task: TARSClassifier internally memorizes the structure of the individual tasks it was trained on e.g, "GO_EMOTIONS" is a multi-labelling task, "IMDB" is a multi-class task and so on. In your example, you set the multi_label_threshold to a small value, but the multi_label flag is still False, as a result TARS just gives you the class with the highest score. You can use the following code to get the desired results:

from flair.models.text_classification_model import TARSClassifier 
from flair.data import Sentence

tars = TARSClassifier.load('tars-base')
tars.switch_to_task("IMDB") 
sentence = Sentence("I absolutely love this!")   
tars.multi_label_threshold = 0.0 #or any other small value
tars.multi_label = True 
tars.predict(sentence) 
print(sentence)

It would output scores for all the classes:

Sentence: "I absolutely love this !"   [− Tokens: 5  − Sentence-Labels: {'label': [positive movie review (0.1077), negative movie review (0.0002)]}]

Regarding the zero shot prediction: We kept the predict_zero_shot interface really simple by design to quickly try out ad-hoc set of labels. It always makes the prediction with multi_label_threshold=0.5 regardless of the tars.multi_label_threshold flag. I would suggest that you add this as a task as shown in the following. Then you would be able to use any threshold you would like.
```
from flair.models.text_classification_model import TARSClassifier 
from flair.data import Sentence 
```

tars = TARSClassifier.load('tars-base') classes = ["positive review", "positive sentiment", "negative review", "negative sentiment"]

tars.add_and_switch_to_new_task("my_task", label_dictionary=classes, multi_label=True, multi_label_threshold=0.0)

sentence = Sentence("I absolutely love this!")
tars.predict(sentence)
print(sentence)

It should output the following:
```python
Sentence: "I absolutely love this !"   [− Tokens: 5  − Sentence-Labels: {'label': [negative sentiment (0.0003), negative review (0.0003), positive review (0.1675), positive sentiment (0.0854)]}]

About the last issue of repeating same label multiple times, please make sure you use different sentence objects during calling predict. If you are calling predict on the same sentence object, flair just keeps appending the labels to the same object. If this was not the case, please reach out.

Hope this helps. Let me know if you need anything else.

with regards, Kishaloy

lucaventurini commented 3 years ago

Hi @kishaloyhalder ,

thanks a lot for your detailed answer!

First of all, I confirm that I was able to retrieve the scores in all cases with your code, so thank you.

I am only confused by the choice of the name of this parameter, multi_label. From your explanation above, it seems to me that it's nothing related with the actual classifier being multilabel or multiclass, it's just an option to retrieve all the scores or just the ones above the set thresholds, true by default for multi_label tasks. Is it like so? If yes, I would rather call it get_all_scores or something like that.

I say this because in transformers there's a similar multiclass parameter in the ZSL pipeline (maybe they should have called it multilabel more properly, this is also a bit confusing imho) : https://discuss.huggingface.co/t/new-pipeline-for-zero-shot-text-classification/681 . Setting this to true actually changes the way the probabilities are computed, as you can see in the same discussion, a few posts after the announcement.

So, seeing a similar parameter in Flair, I was expecting a similar behaviour. Actually, could you please elaborate a bit on the differences in the way you handle the multilabel classification, compared to the transformer pipeline? This would help me understand if it fits my tasks better.

kishaloyhalder commented 3 years ago

Hi @lucaventurini ,

Happy to be able to help!

I understand your confusion. Usually multi-class refers to a classification problem where one out of multiple classes can be true for a data point, and multi-label refers to the tasks where more than one label can be true for a data point (reference). In general, Flair's TextClassifier class follows this notion, which we respect in TARSClassifier as well.

About your question regarding how the TARS differentiates between these two tasks, your understanding is correct. As mentioned in our paper, internally TARS treats it as a binary text classification problem for all possible <label, text> combinations, and returns the ones that are above certain multi_label_threshold (in case of multi_label=True) or only the one with the highest score (in case of multi_label=False).

Hope this helps! Feel free to shoot any further questions.

with regards, Kishaloy

lucaventurini commented 3 years ago

Thank you @kishaloyhalder ,

I think we agree on the definitions of multiclass and multilabel. That's why I was surprised by the name of the parameter in transformers.

So, to recap: internally, TARS is always a multilabel classifier. The multi_label parameter is just a convenience parameter to get, when it's False, only one label, the one with the highest score above the threshold. There are no other changes in the model or in the way the scores are computed.

I think that this, in theory, should have two corollaries, i.e. that for a given tars model and a given text input:

When adding a class to ZSL, the scores we'll have for all the other classes will be always the same;
If two tasks share a label, i.e. the class label is written identically, the score for it will be the same, regardless of what task is active, whether it is multiclass or multilabel, whether is pre-trained or zero-shot.

Correct?

kishaloyhalder commented 3 years ago

Yes, the two corollaries are correct. In the second case, my suggestion would be to add anything contextual in the label to come up with slightly different labels in natural language. We did one such thing during training the model on Amazon reviews and Yelp reviews. Both have labels 1, 2, 3,..., 5. We converted them to "Positive Product Review", "Very Positive Product Review" etc for Amazon and "Positive Restaurant Review","Very Positive Restaurant Review" etc for Yelp (table 5,6 in reference).

Hope this helps @lucaventurini !

with regards, Kishaloy

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flairNLP / flair

Several issues with TARS multilabel scores #2026