Create license-compliant version of the Pile

albertvillanova commented 2 years ago

As discussed with @StellaAthena, this would be useful to constitute the English-language component of the dataset, possibly augmented by the Spotify transcripts dataset.

The creation of this dataset might be decomposed into smaller subsets, as reported by @StellaAthena (see https://github.com/bigscience-workshop/data_tooling/issues/65#issuecomment-971138275):

[ ] Pile CC: Unclear, see below
[x] #74: this was downloaded in a license-compliant fashion
[ ] Books3: excluded
[ ] OpenWebText2: Unclear, see below
[ ] arXiv: needs to be redownloaded and filtered by license
[ ] GitHub: to be replaced by a license-compliant code dataset compiled by Google
[ ] #75
- Good as-is, I have acquired permission to use this from the org that owns the data
[ ] #376
- [x] Good as-is. Dataset link: https://huggingface.co/datasets/the_pile_stack_exchange
[ ] #297
- Good as-is
[x] PubMed
- Good as-is
- [ ] #301
[ ] Project Gutenberg
- Good as-is
[ ] OpenSubtitles: Excluded. Although their website claims to be license complaint this is an obvious lie. They even posted the script of Wonder Woman before the movie debuted. there’s no way in hell they had Disney’s permission to do that.
[ ] #311
- Good as-is
[ ] DM Mathematics
- Good as-is
[x] Ubuntu IRC
- Good as-is
- [ ] #301
[ ] BookCorpus2: Excluded
[x] #378
- Good as-is
[x] HackerNews
- Good as is
- [ ] #301
[ ] YouTube Subtitles: excluded
[ ] PhilPapers: I need to double check but this is either good as-is or needs to be redownloaded and filtered by license
[x] NIH ExPorter
- Good as-is
- [ ] #301
[x] #310
- Good as-is

StellaAthena commented 2 years ago

Under the assumption that we are targeting CC-BY-SA licensed content (which can be disputed… is there a plan for what the licensing for the whole dataset should be yet?) here’s how the current Pile breaks down:

Components of the Pile and their licensing

Pile CC: Unclear, see below
PubMed Central: this was downloaded in a license-compliant fashion
Books3: excluded
OpenWebText2: Unclear, see below
arXiv: needs to be redownloaded and filtered by license
GitHub: to be replaced by a license-compliant code dataset compiled by Google
FreeLaw: Good as-is, I have acquired permission to use this from the org that owns the data
StackExchange: Good as-is
US PTO: Good as-is
PubMed: Good as-is
Project Gutenberg: Good as-is
OpenSubtitles: Excluded. Although their website claims to be license complaint this is an obvious lie. They even posted the script of Wonder Woman before the movie debuted. there’s no way in hell they had Disney’s permission to do that.
Wikipedia (en): Good as-is
DM Mathematics: Good as-is
Ubuntu IRC: Good as-is
BookCorpus2: Excluded
EuroParl: Good as-is
HackerNews: Good as is
YouTube Subtitles: excluded
PhilPapers: I need to double check but this is either good as-is or needs to be redownloaded and filtered by license
NIH ExPorter: Good as-is
Enron Emails: Good as-is

A note on licensing and scraping: Pile-CC and OpenWebText2 pose challenges for legal and ethical compliance. The widespread attitude among organizations seems to be that the common crawl is “it’s own thing” as a dataset and that ToS compliance only requires compliance with CC’s ToS. I think that this is highly dubious ethically, but the same policy would reasonably extend to OWT2. In reality, I strongly suspect that the real reason for this attitude is that it is convenient rather than sensible.

Updating the Pile

Excluding these data sources removed approximately one quarter of the current text of the Pile and massively decreases the proportion of books and subtitle-like text found in it. Consequently, I believe it would be a good idea to identify and add more data to compensate, preferably a lot of it. I need to do some math to figure out how many tokens the deduplicated and cut down Pile is, but I would like at least 300 billion tokens according to the GPT-2 tokenizer and preferably more like 400B so that we can be reasonably confident future tokenizers won’t make the data fall under 300B tokens.

i think that this is also a good opportunity to rectify two issues with for the original pile:

Finding non-western dialects of English
Duplication

The Pile is quite biased towards American and UK English dialects. We sought out sources in Indian English, African American Vernacular English, and several African English dialects but failed to find significant sources of text. It would be excellent if we could identify sources of text in those dialects.

When the Pile came out, the prevailing opinion was that upsampling high information subsets is a good way to improve LM performance. Subsequent research has shown this to be empirically false, and so I highly recommend that we seek to not only not upsample but also apply the 13-grams deduplication technique that has become popular.

Potential Sources of Additional Data:

38 GB of SEC data here.

Apparently English language Project Gutenberg is supposed to be 4-6x the size of what we included in the Pile, see here.

Scraping new content from various websites since the original release.

Many of the Pile components come from governmental sources. Can we find English-language governmental sources in African or Asian dialects of English? Presumably the Kenyan government produces a lot of text, but I do not know where to find it.

albertvillanova commented 2 years ago

I'm working on this: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_the_pile

bigscience-workshop / data_tooling

Create license-compliant version of the Pile #65