[nlp_data] Add BookCorpus

dmlc / gluon-nlp

NLP made easy

https://nlp.gluon.ai/

Apache License 2.0

2.55k stars 538 forks source link

[nlp_data] Add BookCorpus #1406

Open sxjscience opened 3 years ago

sxjscience commented 3 years ago

Description

The book corpus can now have a reliable, stable download link from https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz. Also, there are more links in https://the-eye.eu/public/AI/pile_preliminary_components/ that are worthwhile being included in nlp_data. We may try to download from their link and provide the corresponding license.

szha commented 3 years ago

the data source "smashwords" has a term of service that prohibits redistribution. neither in the links above nor in https://github.com/soskek/bookcorpus/issues/27 was there any mention of getting approval from smashwords or approval from authors. we should clarify the legal risks before proceeding.

shawwn commented 3 years ago

There is no legal risk linking to the dataset. All risk is being taken on by The Eye.

The sole reason not to merge it is because someone doesn't like the idea of using the dataset. Which is fine. But anyone who says there is risk, is mistaken.

shawwn commented 3 years ago

(In other words, don't host the data yourself. Rely on the URL from The Eye. So, for example, all dataset preparation scripts should download from https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz, and books1.tar.gz itself should not be hosted anywhere else. By following this pattern, all risk is transferred to The Eye.)

szha commented 3 years ago

There is no legal risk linking to the dataset

In the US there's recognition of the secondary infringement liability. One can be found guilty for affirmative encouragement or inducing behavior for known copyright violations.

shawwn commented 3 years ago

The datasets are hosted by The Eye, which fully respects DMCA: http://the-eye.eu/dmca

If anyone were to file a DMCA notice against books1 or books3, they would extract the tarball, remove the infringing content, then re-upload the modified tarball.

There is no risk linking to The Eye.

leezu commented 3 years ago

re-upload the modified tarball

In GluonNLP we store a hash of the tarball in source to ensure reproducibility. Linking to a source that will periodically change the contents of the file may not be optimal.

sxjscience commented 3 years ago

We may try to first add it and later figure out if we can hold a snapchat of BookCorpus by ourselves. What do you think?

shawwn commented 3 years ago

Happy to announce that bookcorpus was just merged into huggingface's Datasets library as bookcorpusnew, thanks to @vblagoje: https://github.com/huggingface/datasets/pull/856

So, huggingface is officially supporting this dataset now. The Eye also seems to be a trustworthy steward; I mentioned that "the tarball might change due to DMCA" as more of a theoretical concern rather than a practical reality. I doubt this tarball is going to change.

sxjscience commented 3 years ago

@shawwn Really appreciate the information! I've tried out huggingface/datasets and find that it's quite good. In fact we can add it even if the tarball changes. It's the same as the strategy of the wikipedia corpus that we added: https://github.com/dmlc/gluon-nlp/blob/master/scripts/datasets/pretrain_corpus/prepare_wikipedia.py. Part of the purpose of nlp_data is to help the user download and prepare some large pretraining corpus for trying out NLP pretraining.