EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

RePEc #47

Closed cfoster0 closed 3 years ago

cfoster0 commented 3 years ago

Language: predominantly English Date ranges: 1997-2020 Size: Claims 2M downloadable articles, 800K working papers, 26K books, and 59K chapters

Research Papers in Economics (RePEc) is a collaborative effort of hundreds of volunteers in many countries to enhance the dissemination of research in economics. The heart of the project is a decentralized database of working papers, preprints, journal articles, and software components.

We would be extracting the text components only. From what I've seen, it's PDFs.

http://www.repec.org/

cfoster0 commented 3 years ago

For scraping, we can traverse the following open directories that index the various content forms:

The bottom level links are pages on the RePEc Ideas database. The downloadable link on that page (if it exists) seems to be the value of an input button tagged as "url".

cfoster0 commented 3 years ago

This should probably be deferred to v2, since (1) it's huge, needs time to download and (2) I've tried our PDF-to-text on some samples and I think it'll need at least a slight rework.

QazQazaq commented 3 years ago

This would be a good addition.