Closed donglixp closed 4 years ago
Hi, in the XLM-R paper they used a Facebook internal LID model, which has support for Hausa. The LID model isn't open source, and I'm not aware of plan to open source it. But I'm currently working on releasing the data used in the XLM-R paper, which will contain the Hausa corpus.
Hi Li, I've added dl_cc_100.py that will contains the Hausa. If you don't want run this yourself you can look at http://data.statmt.org/cc-100/, that have been released recently but I'm not 100% sure that you'll get the same results.
@gwenzek Thanks for the open-source effort! The corpus is very useful to the community.
Hi Li, I've added dl_cc_100.py that will contains the Hausa. If you don't want run this yourself you can look at http://data.statmt.org/cc-100/, that have been released recently but I'm not 100% sure that you'll get the same results.
@gwenzek , is the data shared in http://data.statmt.org/cc-100/
, the same used to pre-train XLM-R ? because when comparing the document sizes per language, in this web page and XLM-R paper, they are not the same values
Thanks
Thanks for your contribution to the community. I am wondering whether the ccnet contains the
Hausa
language (ISO id: ha/hau)? Because in the xlm-r paper, Table 6 mentioned that Hausa was included in CCNet. However, I didn't find the language code ofHausa
in the dumped files and fasttext lid's document.