Open osainz59 opened 11 months ago
Hello, does your documents
refer to the total number of single data item (namely each json line)? I also had the same problem where our total data items are similar to your number.
Is your downloaded data around 800 GB? We downloaded the pile but only around ~400 GB. That's probably why it is only half of it?
Yes, by documents
I refer to each data item. I only have the preprocessed data after downloading the GitHub partition, so I do not know the numbers for the complete dataset.
Yes, by
documents
I refer to each data item. I only have the preprocessed data after downloading the GitHub partition, so I do not know the numbers for the complete dataset.
How large is your downloaded dataset in total? Mine is ~400 GB which is also half of the paper (800 GB). I guess that might be the reason
Hi there,
I followed the GitHub downloader repository and executed the download_repo_text.py script.
I obtained a total of 27,819,203 documents, just half of the documents reported here: https://github.com/EleutherAI/the-pile/blob/df97f8651ae3da658b19659b3ceaa6a34b0fc014/the_pile/datasets.py#L704
I fixed and added some metadata I need for my analyses. In total the file is around 60Gb in disc. I did not run the github_reduce.py yet, as the full dataset is not the same as reported by the authors.
Also, as the links to the GitHub partition are not available anymore, I would like to know if there is something I can do to obtain the original GitHub data that is in The Pile (hopefully with correct metadata).
Thank you