Confuse when finetune XLM-R as a language model on a monolingual dataset

Luvata commented 4 years ago

❓ Questions and Help

I want to finetune XLM-R language model on my additional monolingual dataset. After some research, I think my steps are:

Preprocessing:
- Follow prepare-iwslt17-multilingual.sh to encode my additional dataset using XLM-R's learned sentencepiece model
- Then I use fairseq-preprocess with XLM-R's dict.txt to binarize dataset from .bpe file
Training language model with fairseq-train

Preprocessing part seems to work correctly. But on training, I'm really confused when choosing tasks and model architecture

I see masked_lm, language_modeling and multilingual_masked_lm. In addition, in XLM-R README mentioned that

xlmr.large | XLM-R using the BERT-large architecture

But I also see XLM-R is a subclass of Roberta, so what --arch should I use ? roberta_large, bert_large and xlm ?

Thank you in advance

ngoyal2707 commented 4 years ago

Is your monolingual data in one of the language of xlmr's 100 languages?

Luvata commented 4 years ago

Yes, it's

ngoyal2707 commented 4 years ago

Sorry, it's bit confusing. Use, --arch roberta_large and --task multilingual_masked_lm