facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.43k stars 6.4k forks source link

Multilingual RoBERTa? #952

Closed praateekmahajan closed 4 years ago

praateekmahajan commented 5 years ago

Thanks for the easy to understand paper and repo. This issue is more of a question wrt paper than the repo. The convenient thing about BERT (from an industry perspective) was the availability multilingual model and tokenizer.

I was hoping if there are plans to have a multilingual RoBERTa?

And a few followup (not important) questions are

  1. Wikipedia might be available in 100s of languages about what about CC-News?
  2. Will you replicate exactly the dataset for all languages?
myleott commented 5 years ago

We're considering a multilingual RoBERTa, but do not currently have concrete plans. CC-News does have decent cross-language coverage.

We do obtain very good results on XNLI in the TRANSLATE-TEST setting with the current model, so you may have success translating into English depending on your task. Also the RoBERTa tokenizer (BPE) can encode any unicode string, so there won't be unknown tokens regardless of input language, but I wouldn't necessarily expect good performance out of the box on non-English text.

In the meantime you might also try XLM: https://github.com/facebookresearch/XLM/

mortonjt commented 5 years ago

@myleott , is there documentation on the format of these files?

In the pretraining tutorial, I notice 3 files namely vocab.bpe, dict.txt and encoder.json, but haven't seen documentation those formats yet.

josecannete commented 5 years ago

Any news about a possible multilingual RoBERTa? 😁 @myleott

ngoyal2707 commented 5 years ago

@josecannete We have it under plans, there's no fixed date of release though.

mortonjt commented 5 years ago

@ngoyal2707 , would a documentation update to the README to detail what the contents of vocab.bpe, dict.txt and encoder.json be enough to get started? If I'm not mistaken

dict.txt is a count of the word frequencies (in some training dataset I presume).

encoder.json is a mapping between ids and their word corresponding words (in bpe format, specified in vocab.bpe).

ngoyal2707 commented 4 years ago

Here you go: https://github.com/pytorch/fairseq/tree/master/examples/xlmr

Java-2022 commented 4 years ago

@ngoyal2707 , would a documentation update to the README to detail what the contents of vocab.bpe, dict.txt and encoder.json be enough to get started? If I'm not mistaken

dict.txt is a count of the word frequencies (in some training dataset I presume).

encoder.json is a mapping between ids and their word corresponding words (in bpe format, specified in vocab.bpe).

Please guide me to build encoder.json for other language

XiaoqingNLP commented 4 years ago

@myleott hi, I can't find any information for accessing the dataset mentioned in XLM-R/mBART. There are a plan to release the dataset ?