[nlp_data] Add CC-100 - Githubissues

dmlc / gluon-nlp

NLP made easy

https://nlp.gluon.ai/

Apache License 2.0

2.55k stars 538 forks source link

Open sxjscience opened 3 years ago

sxjscience commented 3 years ago

Add the CC-100 corpus that can be used for pretraining to nlp_data.

sxjscience commented 3 years ago

Marked it as a good first issue because we have the documentation about how to add a new dataset in https://github.com/dmlc/gluon-nlp/tree/master/scripts/datasets

wlehner commented 2 years ago

I'd like to work on this.