dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

[nlp_data] Add CC-100 #1419

Open sxjscience opened 3 years ago

sxjscience commented 3 years ago

Description

Add the CC-100 corpus that can be used for pretraining to nlp_data.

http://data.statmt.org/cc-100/

sxjscience commented 3 years ago

Marked it as a good first issue because we have the documentation about how to add a new dataset in https://github.com/dmlc/gluon-nlp/tree/master/scripts/datasets

wlehner commented 2 years ago

I'd like to work on this.