EleutherAI / the-pile

MIT License
1.48k stars 129 forks source link

URL Links #109

Open akul-goyal opened 1 year ago

akul-goyal commented 1 year ago

Is it possible to gain access to the URL links (or any website information) from which the data was scrapped to generate PILE?

dboggs95 commented 1 year ago

@akul-goyal Read their paper, page 14. https://arxiv.org/abs/2101.00027 https://arxiv.org/pdf/2101.00027.pdf

If I understand correctly, due to the copyrighted nature of some of their datasets, they don't host direct links to all of them.

However, many of the links in the readme are to scripts that will download them. I have only used Project Gutenberg so far, but I assume if you run pile.py with the --force-download command it will download all 1.2 TB of data, minus the books3 datasource from Bibliotek, which must be commented out of the code in order for it to work.

akul-goyal commented 1 year ago

Hi, @dboggs95 thanks for the response. I was interested in more fine-grained website information rather than the links to the actual dataset. For example, for the youtube caption dataset, I am interested in the URL of the youtube video used to collect the data. This GitHub repo currently contains scrapping scripts to collect data, but it does not specify links used to create Pile. Furthermore, even if the URL does exist, it is not clear to me that a mapping exists from the URL to the scrapped text.