EleutherAI / the-pile

MIT License
1.47k stars 127 forks source link

Libgen (cleaning the already-extracted text, see #data-sources) #1

Closed StellaAthena closed 4 years ago

StellaAthena commented 4 years ago

priority: very high

StellaAthena commented 4 years ago

Breaking into two issues, Epubs #11 and PDFs #12

StellaAthena commented 3 years ago

We are using Bibliotik #22 for now, as it's easier to process. LibGen has more variety and may replace Bibliotik in the future.