EleutherAI / pilev2

MIT License
13 stars 9 forks source link

Internet Archive #13

Open upintheairsheep opened 1 year ago

upintheairsheep commented 1 year ago

http://archive.org/ - Contact the internet archive to give you a listing of all the data you want, the Internet Archive is a giant library filled with documents (books, manuals, and other random PDF files) and other interesting files, including JSONs from mirrored online videos, sometimes including their comments and just random important documents. For PDF files, they provide a variety of formats, like an OCR txt and a OCR xml. See https://archive.org/download/andrus-thesis as an example. Just to note, they also include mirrored online videos including their metadata and sometimes comments. See https://archive.org/download/youtube-DPMluEVUqS0 as an example of this and https://archive.org/download/instagram-apple as another format commonly used. The archive also provides directory listings on common compressed files, so you can scrape them for documents too. See #11 for formats.