EleutherAI / the-pile

MIT License
1.48k stars 129 forks source link

Reducing download size #106

Open marionbartl opened 1 year ago

marionbartl commented 1 year ago

Hi! I would like to create a subset of the pile that is ~5G in size. The final subset should follow the original distribution of datasets and the documents included should be randomly sampled from the datasets.

I tried to work with the --limit, --read_amount, and --make_dataset_samples parameters to reduce the download size, but when I run the script, each dataset is downloaded in the original size.

I would greatly appreciate it if you could tell me whether what I'm looking for is achievable with this repo and what the command for that would be.

Thanks!