facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.89k stars 498 forks source link

Using TLM for monolingual text classification #139

Open foolooo opened 5 years ago

foolooo commented 5 years ago

Thank you for the excellent work.

So I have been trying to leverage my labelled English data to do short text (sentence) classification for Spanish. Firstly I'm comparing the result for monolingual (English) using TLM model and BERT. I got acc. 79% for BERT and 73% for TLM. I tried fine-tuning using linear classifier and a) the 1st hidden layer, b) all hidden layers+max_pooling and c) all hidden layers+mean_pooling. The best result I got was c). But still not quite close to BERT. I haven't tried MLM model though, since I want to use TLM for other language as well.

The configuration for my training is as below: learning rate = 1e-5 (tried 5e-6, 5e-5, 1e-4) batch size = 8 (tried 16, 32) optimizer = Adam There are 200 classes with about 20,000 samples.

Do you have any suggestion on how I shall improve this? Thanks a lot!

airkid commented 5 years ago

Hi, I have a similar trying on monolingual classification. I'm curious that did you use the first token of input sentence (<\s>) for classification or another strategy? Thanks!

foolooo commented 5 years ago

Hi, I have a similar trying on monolingual classification. I'm curious that did you use the first token of input sentence (<\s>) for classification or another strategy? Thanks!

Hi airkid,

I tried using the first tokens and also the mean of all tokens. So far the mean of all tokens works best (2% improvement) but not quite close to BERT. I'm not sure if I missed something. Or TLM was expected to be lower (6 percent in my case) than MLM for monolingual classification.

Thanks

airkid commented 5 years ago

Hi, I have a similar trying on monolingual classification. I'm curious that did you use the first token of input sentence (<\s>) for classification or another strategy? Thanks!

Hi airkid,

I tried using the first tokens and also the mean of all tokens. So far the mean of all tokens works best (2% improvement) but not quite close to BERT. I'm not sure if I missed something. Or TLM was expected to be lower (6 percent in my case) than MLM for monolingual classification.

Thanks

Hi foolooo, thanks for your replay and experiment result!

aconneau commented 5 years ago

"Firstly I'm comparing the result for monolingual (English) using TLM model and BERT."

Which BERT model are you using, and which TLM model?

foolooo commented 5 years ago

"Firstly I'm comparing the result for monolingual (English) using TLM model and BERT."

Which BERT model are you using, and which TLM model?

Hi aconneau,

Thank you for your reply. It is uncased_L-12_H-768_A-12 for BERT and mlm_tlm_xnli15_1024.pth for TLM. Thanks.

aconneau commented 5 years ago

If you have a Spanish task, your Spanish MLM (or MLM+TLM) model will perform better than your multilingual MLM (or MLM+TLM) model. So the fact that you get worse results with a multilingual model is not surprising, however 6% seems big, and I would have expected a -3% instead (which is on average what I observed on GLUE tasks). No problem with tokenization (did you apply the right tokenization in both settings)?