Open albertvillanova opened 2 years ago
Under the assumption that we are targeting CC-BY-SA licensed content (which can be disputed… is there a plan for what the licensing for the whole dataset should be yet?) here’s how the current Pile breaks down:
Components of the Pile and their licensing
A note on licensing and scraping: Pile-CC and OpenWebText2 pose challenges for legal and ethical compliance. The widespread attitude among organizations seems to be that the common crawl is “it’s own thing” as a dataset and that ToS compliance only requires compliance with CC’s ToS. I think that this is highly dubious ethically, but the same policy would reasonably extend to OWT2. In reality, I strongly suspect that the real reason for this attitude is that it is convenient rather than sensible.
Updating the Pile
Excluding these data sources removed approximately one quarter of the current text of the Pile and massively decreases the proportion of books and subtitle-like text found in it. Consequently, I believe it would be a good idea to identify and add more data to compensate, preferably a lot of it. I need to do some math to figure out how many tokens the deduplicated and cut down Pile is, but I would like at least 300 billion tokens according to the GPT-2 tokenizer and preferably more like 400B so that we can be reasonably confident future tokenizers won’t make the data fall under 300B tokens.
i think that this is also a good opportunity to rectify two issues with for the original pile:
The Pile is quite biased towards American and UK English dialects. We sought out sources in Indian English, African American Vernacular English, and several African English dialects but failed to find significant sources of text. It would be excellent if we could identify sources of text in those dialects.
When the Pile came out, the prevailing opinion was that upsampling high information subsets is a good way to improve LM performance. Subsequent research has shown this to be empirically false, and so I highly recommend that we seek to not only not upsample but also apply the 13-grams deduplication technique that has become popular.
Potential Sources of Additional Data:
38 GB of SEC data here.
Apparently English language Project Gutenberg is supposed to be 4-6x the size of what we included in the Pile, see here.
Scraping new content from various websites since the original release.
Many of the Pile components come from governmental sources. Can we find English-language governmental sources in African or Asian dialects of English? Presumably the Kenyan government produces a lot of text, but I do not know where to find it.
I'm working on this: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_the_pile
As discussed with @StellaAthena, this would be useful to constitute the English-language component of the dataset, possibly augmented by the Spotify transcripts dataset.
The creation of this dataset might be decomposed into smaller subsets, as reported by @StellaAthena (see https://github.com/bigscience-workshop/data_tooling/issues/65#issuecomment-971138275):