[Zero-shot] Classification performance

naibaf1991 commented 5 years ago

Hi guys,

first of all congrats to this great model! I do multilingual tweet classification and it performs stunningly for monolingual cases. I freeze the model and use a linear layer on top of the first output token.

However, I want to do zero-shot classification on Spanish and English tweets and get strange results. When I train the model with Spanish tweets and evaluate on English ones, it performs pretty well. However, if I do it vice versa, so I train on English tweets and evaluate on Spanish tweets, the performance goes down heavily (-20% F1).

Do you have an idea what could be the reason or do you have general hints regarding zero-shot learning with XLM? I would appreciate any help. Thank you very much in advance!

aconneau commented 5 years ago

Hi, thanks for the kind words! Let me try to understand better: so in both cases, you start with an XLM model trained on both English and Spanish tweets, and then you either (1) fine-tune this XLM on an English tweet classification task and look at the results on Spanish (2) fine-tune this XLM on a Spanish tweet classification task and look at the results on English ?

aconneau commented 5 years ago

*Or do you use one of our pretrained model?

naibaf1991 commented 5 years ago

Hi Alex, thanks for your quick reply! I use the XNLI-15 pretrained model you provide as feature extractor, thus I do not perform any fine-tuning myself. For now, I put a simple Dense layer on top of the first sequence token of the XLM output.

Then, I (1) train my model (the Dense layer) on English tweets and evaluate on Spanish, with very poor results. And (2) train my model on Spanish tweets and evaluate on English, with good results.

naibaf1991 commented 5 years ago

@aconneau Do you have an idea what could be the reason for this performance or do you have general hints regarding zero-shot learning with XLM? Thank you very much in advance!

aconneau commented 5 years ago

@naibaf1991 : I believe you're using the XNLI-15 in the wrong way, in the sense that you're using it as a fixed-size sentence encoder while it hasn't been trained to be a fixed-size cross-lingual sentence embedding method. If you really want a fixed-size sentence representation, I would try to use the average of the hidden states instead of the first one, as the first hidden state (if not fine-tuned) is not meant a priori to incorporate all the information about the input sentence. If your implementation allows it, you should fine-tune the XLM model.

naibaf1991 commented 5 years ago

@aconneau : Thanks for your reply! From your paper, you state for the XNLI task:

We use the first hidden state of the last layer of the transformer as input to the randomly initialized final linear classifier, and fine-tune all parameters. In our experiments, using either max-pooling or mean-pooling over the last layer did not work better than using the first hidden state.

Thus, I assumed that this would also be applicable for a binary classification task, especially as we found out that using the final layer output compared to the first hidden state works better.

We also tried averaging, as you proposed, but still have the problem of the weak performance for English training data + Spanish test data, which we do not have vice versa. Could you think of an explanation for this? Could it be, that XLM detects "deeper" latent structures in English data and thus generalizes worse when trained on those and evaluated on Spanish data?

aconneau commented 5 years ago

This is applicable to a binary classification task, but the first hidden state becomes some "summary" of an input sentence only when it is fine-tuned. If you do not finetune it, it's just one hidden state among others. That's why, if you don't fine-tune, I was suggesting you should use the average. I am not sure I have a clear explanation for your results on English->Spanish vs Spanish->English. I would try harder to make it work (for instance by fine-tuning) before trying to draw conclusions.

naibaf1991 commented 5 years ago

We found the solution to the Zeroshot / Fewshot problem: We gave each language their respective language tags, which resulted in the poor performance. However, if we give the same language tag for both languages, the performance is great again. Do you have an explanation for this?

Furthermore, thanks for your hint regarding the first state, we changed our top layer and could improve our results. Of course, I'm sure that finetuning would help even more.

Ayush-iitkgp commented 4 years ago

I fine-tuned the XLM model on the English annotated data (around 8 million sentences and 201 classes) and used the model to predict on the Spanish language, the model performance is around 17%(accuracy). Could @naibaf1991 and @aconneau help me understand why is the model performance so bad? Also, the model performance on the Spanish language decreased as I trained the model for more epochs. Any help would be appreciated. TIA!

AiSingularity commented 4 years ago

@naibaf1991 Hello; I have a similar problem to yours. What did you mean by giving " language tags" in your comment? Thanks

facebookresearch / XLM

[Zero-shot] Classification performance #115