Separate functions for downloading pre-processed and datasets and downloading & processing

I think it would be a good idea, where possible, to have separate functions in the pile for downloading a pre-processed version of the dataset from a hosted version (i.e a single wget) and running the entire replication step.

The Stackexchange dataset, for example, requires a large amount of storage for processing, and is pretty slow. It's good that people are able to recreate the pipeline if possible (and the stackexchange data on archive.org is being updated fairly regularly, so will grow over time), but in general, it would be better to host the data somewhere.

There are also some functions that only provide download, and no replication steps. I think these two things should be separated where possible.

So, we should first provide a hosted dataset, then fall back to full downloading and pre-processing, if, say, we can no longer host.

EleutherAI / the-pile

Separate functions for downloading pre-processed and datasets and downloading & processing #40