EleutherAI / the-pile

MIT License
1.48k stars 129 forks source link

Issue reproducing the GitHub partition #118

Open osainz59 opened 11 months ago

osainz59 commented 11 months ago

Hi there,

I followed the GitHub downloader repository and executed the download_repo_text.py script.

I obtained a total of 27,819,203 documents, just half of the documents reported here: https://github.com/EleutherAI/the-pile/blob/df97f8651ae3da658b19659b3ceaa6a34b0fc014/the_pile/datasets.py#L704

I fixed and added some metadata I need for my analyses. In total the file is around 60Gb in disc. I did not run the github_reduce.py yet, as the full dataset is not the same as reported by the authors.

Also, as the links to the GitHub partition are not available anymore, I would like to know if there is something I can do to obtain the original GitHub data that is in The Pile (hopefully with correct metadata).

Thank you

Zengyu-98 commented 11 months ago

Hello, does your documents refer to the total number of single data item (namely each json line)? I also had the same problem where our total data items are similar to your number.

Is your downloaded data around 800 GB? We downloaded the pile but only around ~400 GB. That's probably why it is only half of it?

osainz59 commented 11 months ago

Yes, by documents I refer to each data item. I only have the preprocessed data after downloading the GitHub partition, so I do not know the numbers for the complete dataset.

Zengyu-98 commented 11 months ago

Yes, by documents I refer to each data item. I only have the preprocessed data after downloading the GitHub partition, so I do not know the numbers for the complete dataset.

How large is your downloaded dataset in total? Mine is ~400 GB which is also half of the paper (800 GB). I guess that might be the reason