EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

PhilPapers #43

Closed cfoster0 closed 3 years ago

cfoster0 commented 3 years ago

Language: primarily but not exclusively English Date ranges: 1600s-2020 Size: +260,000 works indexed and +52,000 available to download via PhilArchive

PhilPapers is an international, interactive academic database of journal articles for professionals and students in philosophy. It is maintained by the Centre for Digital Philosophy at the University of Western Ontario.

I think we could at least add the section of open access works, which should all be downloadable as PDFs from https://philarchive.org/ There may also be other works availabe for download on the main site. In addition, we could probably scrape abstracts for the rest.

thoppe commented 3 years ago

On it! Should be done in a few days. Quick sampling of about 1K documents from https://philarchive.org/ give an estimate of about 3GB of total text. This is without language filtering, as there are some documents in Russian and Italian (at least). It looks like this may knock it down 10-20%.

While the website claims 52K, the API reports around 45K. Many articles are duplicated or have been deleted/retracted.

leogao2 commented 3 years ago

@thoppe It would be best if we don't do language filtering.

thoppe commented 3 years ago

This is complete. https://github.com/thoppe/The-Pile-PhilPapers

 ✔ Saved to data/PhilArchive.jsonl
 ℹ Saved 33,990 articles
 ℹ Uncompressed filesize 2,610,566,629
 ℹ Compressed filesize     79,7708,027

The issue can be closed and marked done on the board.