EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

Separate functions for downloading pre-processed and datasets and downloading & processing #40

Closed sdtblck closed 3 years ago

sdtblck commented 3 years ago

I think it would be a good idea, where possible, to have separate functions in the pile for downloading a pre-processed version of the dataset from a hosted version (i.e a single wget) and running the entire replication step.

The Stackexchange dataset, for example, requires a large amount of storage for processing, and is pretty slow. It's good that people are able to recreate the pipeline if possible (and the stackexchange data on archive.org is being updated fairly regularly, so will grow over time), but in general, it would be better to host the data somewhere.

There are also some functions that only provide download, and no replication steps. I think these two things should be separated where possible.

So, we should first provide a hosted dataset, then fall back to full downloading and pre-processing, if, say, we can no longer host.

StellaAthena commented 3 years ago

I think that providing the replication code and a way to directly download the processed data sets separately is a great idea. I would go as far as to say that it's necessary for usability.