CC-100 in statmt version is different from paper

Hi, first of all, thank you for your great work on multilingual NLP. I'm trying to replicate XLM-R in my own reasearch, and I found that the corpus from statmt is very different from the description in XLM-R paper. For example, in the case of Esperanto, there are 157M tokens in the paper, but in the statmt version there are actually about 290M tokens. I tokenized with both sentencepiece + fairseq-preprocess and transformers tokenizer (xlm-roberta-base) for double-checking.

I guess the content of the corpus would be similar (know that CC was based on web-scrapping) since they have similar file size (which is 0.9GiB), but what makes them so different?

facebookresearch / cc_net

CC-100 in statmt version is different from paper #48