EleutherAI / pilev2

MIT License
13 stars 9 forks source link

Documents scraped from Open Directories #11

Open upintheairsheep opened 1 year ago

upintheairsheep commented 1 year ago

https://odcrawler.xyz/ You have the ability to search by document type, you should get all the html, htm, pdf, doc, docx, json, xls, xlsx, java, xml, js, css, py, ppt, pptx, txt, csv, md, odf, vtt, srt, and tex files in the Collection and scrape each and every one of them, including the directory listings themselves, and make sure to integrate their names. Make sure to exclude certain directories that are datasets themselves, which will be added separately.