facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.87k stars 495 forks source link

How to use XLM-R? #337

Closed leo-liuzy closed 3 years ago

leo-liuzy commented 3 years ago

Thanks for the great code base and tutorial for XLM.

From my understanding, this repository only works with XLM and BPE, but not for XLM-R and SentencePiece

I am trying to use SentencePiece with the model but it seems very hard. I tried using load XLM-R's dict.txt with your Dictionary; but the tokenization somehow results in lots of UNK words. Could you help to give some insights and guidance?

To get around this, I decided to use sentencepiece model and use Huggingface's tokenizer in place of Dictionary, but the problem arise at this line. I couldn't find any public resources for the count of each token. Could you release the "token2count" file?

Litsay commented 3 years ago

Hello! I have the similar problem with you. How about your progress now?

leo-liuzy commented 3 years ago

Hi, I just figure out today that you could use multilingual_masked_lm task in fairseq. You first run sentencepiece tokenizer to tokenized the corpus (this is a MUST!), then run fairseq-preprocess to binarize the corpus and then, you are good to do whatever.

The huge number of UNKs was caused by not tokenizing first; because fairseq just do a naive lookup with theDictionary.

For the count thing, I haven't figured it out whether I would use it; but you could certainly get log prob from sentencepiece from released xlmr.base/large (using get_score(token_id) or something similar), and since count and log prob has the same order, you could use log prob to do some word sampling if you need.

Litsay commented 3 years ago

Thank you for your advice! It has helped me a lot:)