Closed praateekmahajan closed 4 years ago
We're considering a multilingual RoBERTa, but do not currently have concrete plans. CC-News does have decent cross-language coverage.
We do obtain very good results on XNLI in the TRANSLATE-TEST setting with the current model, so you may have success translating into English depending on your task. Also the RoBERTa tokenizer (BPE) can encode any unicode string, so there won't be unknown tokens regardless of input language, but I wouldn't necessarily expect good performance out of the box on non-English text.
In the meantime you might also try XLM: https://github.com/facebookresearch/XLM/
@myleott , is there documentation on the format of these files?
In the pretraining tutorial, I notice 3 files namely vocab.bpe
, dict.txt
and encoder.json
, but haven't seen documentation those formats yet.
Any news about a possible multilingual RoBERTa? 😁 @myleott
@josecannete We have it under plans, there's no fixed date of release though.
@ngoyal2707 , would a documentation update to the README to detail what the contents of vocab.bpe
, dict.txt
and encoder.json
be enough to get started? If I'm not mistaken
dict.txt
is a count of the word frequencies (in some training dataset I presume).
encoder.json
is a mapping between ids and their word corresponding words (in bpe format, specified in vocab.bpe
).
@ngoyal2707 , would a documentation update to the README to detail what the contents of
vocab.bpe
,dict.txt
andencoder.json
be enough to get started? If I'm not mistaken
dict.txt
is a count of the word frequencies (in some training dataset I presume).
encoder.json
is a mapping between ids and their word corresponding words (in bpe format, specified invocab.bpe
).
Please guide me to build encoder.json for other language
@myleott hi, I can't find any information for accessing the dataset mentioned in XLM-R/mBART. There are a plan to release the dataset ?
Thanks for the easy to understand paper and repo. This issue is more of a question wrt paper than the repo. The convenient thing about BERT (from an industry perspective) was the availability
multilingual
model and tokenizer.I was hoping if there are plans to have a
multilingual RoBERTa
?And a few followup (not important) questions are