dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 538 forks source link

[Numpy Refactor] Dataset Enhancement TODO #1240

Open sxjscience opened 4 years ago

sxjscience commented 4 years ago

Now we have the PR about the new version of GluonNLP: https://github.com/dmlc/gluon-nlp/pull/1225, which refactors the major APIs and will rely on the DeepNumpy interface in MXNet.

Basically, we refactored the way the user will download and prepare the common NLP datasets. Previously, we will rely on python and create some XXXDataset object and access the data.

Now, we have switched to the new nlp_data + nlp_preprocess CLI commands to help you download and prepare the dataset.

# Prepare Squad
nlp_data prepare_squad --version 2.0
# Prepare WMT
nlp_data prepare_wmt --dataset wmt2014 --lang-pair en-de --save-path wmt2014_en_de
# Download Wikipedia
nlp_data prepare_wikipedia --mode download --lang en --date latest -o ./

We can enhance the dataset support by adding:

In addition, we will consider to move part of the datasets to our internal S3, which will offer fast downloading speed (if license allows).