EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

Small Flag #20

Closed anishthite closed 3 years ago

anishthite commented 3 years ago

to assist with data exploration and testing (both ours and that of other people) we should add a “small” flag that causes it to download a small amount of data (size TBD... 10M per data source? --Stella

StellaAthena commented 3 years ago

Thanks for writing down this suggestion! @leogao2, you’ve had the most experience processing these data sets. What do you think is a good size chunk that someone can use to ensure the data is being downloaded and processed correctly?

leogao2 commented 3 years ago

Most of these datasets are provided as big tarballs so we'd have to host the small ones separately which would add a lot of complexity.

StellaAthena commented 3 years ago

Hmmm. I hasn’t thought about that. And unzipping a tarball is an all-or-nothing thing right?

anishthite commented 3 years ago

You can partially unzip a tarball by separating it by file (https://unix.stackexchange.com/questions/42198/untar-only-a-certain-number-of-files-from-a-large-tarball/42199).

StellaAthena commented 3 years ago

I think that adding that functionality would be highly useful then.

StellaAthena commented 3 years ago

@leogao2 @researcher2 I believe the branch fix_size is the implementation of this functionality. Is that correct? If so, is it finished? What more needs to be done?

researcher2 commented 3 years ago

That branch was originally to do some changes to the output of size() on some datasets and is unrelated to this, and then I used it for testing on Hetzner. I will delete it from the repository to avoid further confusion.

StellaAthena commented 3 years ago

That branch was originally to do some changes to the output of size() on some datasets and is unrelated to this, and then I used it for testing on Hetzner. I will delete it from the repository to avoid further confusion.

Very well. How about this issue? Has a fix been merged or is it still outstanding?