Open sxjscience opened 3 years ago
Add the CC-100 corpus that can be used for pretraining to nlp_data.
nlp_data
http://data.statmt.org/cc-100/
Marked it as a good first issue because we have the documentation about how to add a new dataset in https://github.com/dmlc/gluon-nlp/tree/master/scripts/datasets
I'd like to work on this.
Description
Add the CC-100 corpus that can be used for pretraining to
nlp_data
.http://data.statmt.org/cc-100/