EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

Biodiversity Heritage Library #42

Closed cfoster0 closed 3 years ago

cfoster0 commented 3 years ago

Language: primarily English, with a few thousand works total in German, French, Spanish, Dutch, Portuguese, and Latin Date ranges: Primarily pre-1923 Size: Unclear. A large number of full length books, so likely > 1GB.

The Biodiversity Heritage Library has a very large collection (~250,000) of pre-OCR'd historical books and documents on natural history topics. https://about.biodiversitylibrary.org/tools-and-services/developer-and-data-tools/

The individual .txt file links are listed in the ItemTextURL column of this TSV (warning: this link leads to a 40+MB file) https://www.biodiversitylibrary.org/data/hosted/item.txt

My primary concern is with the quality of the OCR.

StellaAthena commented 3 years ago

I think that this would be a phenomenal way to augment our knowledge set. A key question is just how low-quality the OCR is though, and how much work we would expect processing it to take.