Hi, first of all, thank you for your great work on multilingual NLP.
I'm trying to replicate XLM-R in my own reasearch, and I found that the corpus from statmt is very different from the description in XLM-R paper.
For example, in the case of Esperanto, there are 157M tokens in the paper, but in the statmt version there are actually about 290M tokens.
I tokenized with both sentencepiece + fairseq-preprocess and transformers tokenizer (xlm-roberta-base) for double-checking.
I guess the content of the corpus would be similar (know that CC was based on web-scrapping) since they have similar file size (which is 0.9GiB), but what makes them so different?
Hi, first of all, thank you for your great work on multilingual NLP. I'm trying to replicate XLM-R in my own reasearch, and I found that the corpus from statmt is very different from the description in XLM-R paper. For example, in the case of Esperanto, there are 157M tokens in the paper, but in the statmt version there are actually about 290M tokens. I tokenized with both sentencepiece + fairseq-preprocess and transformers tokenizer (
xlm-roberta-base
) for double-checking.I guess the content of the corpus would be similar (know that CC was based on web-scrapping) since they have similar file size (which is 0.9GiB), but what makes them so different?