Documents scraped from Open Directories

https://odcrawler.xyz/ You have the ability to search by document type, you should get all the html, htm, pdf, doc, docx, json, xls, xlsx, java, xml, js, css, py, ppt, pptx, txt, csv, md, odf, vtt, srt, and tex files in the Collection and scrape each and every one of them, including the directory listings themselves, and make sure to integrate their names. Make sure to exclude certain directories that are datasets themselves, which will be added separately.

EleutherAI / pilev2

Documents scraped from Open Directories #11