facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
964 stars 139 forks source link

support of Hausa #9

Closed donglixp closed 4 years ago

donglixp commented 4 years ago

Thanks for your contribution to the community. I am wondering whether the ccnet contains the Hausa language (ISO id: ha/hau)? Because in the xlm-r paper, Table 6 mentioned that Hausa was included in CCNet. However, I didn't find the language code of Hausa in the dumped files and fasttext lid's document.

gwenzek commented 4 years ago

Hi, in the XLM-R paper they used a Facebook internal LID model, which has support for Hausa. The LID model isn't open source, and I'm not aware of plan to open source it. But I'm currently working on releasing the data used in the XLM-R paper, which will contain the Hausa corpus.

gwenzek commented 3 years ago

Hi Li, I've added dl_cc_100.py that will contains the Hausa. If you don't want run this yourself you can look at http://data.statmt.org/cc-100/, that have been released recently but I'm not 100% sure that you'll get the same results.

donglixp commented 3 years ago

@gwenzek Thanks for the open-source effort! The corpus is very useful to the community.

Hadjerkhd commented 3 years ago

Hi Li, I've added dl_cc_100.py that will contains the Hausa. If you don't want run this yourself you can look at http://data.statmt.org/cc-100/, that have been released recently but I'm not 100% sure that you'll get the same results.

@gwenzek , is the data shared in http://data.statmt.org/cc-100/, the same used to pre-train XLM-R ? because when comparing the document sizes per language, in this web page and XLM-R paper, they are not the same values Thanks