digipres / registries-of-practice-project

The "Registries of Good Practice" Project
MIT License
6 stars 0 forks source link

Collect samples of formats from Common Crawl #20

Open anjackson opened 2 months ago

anjackson commented 2 months ago

As it's always difficult to find shareable files of various formats, one option would be to use the Common Crawl indexes to find relevant items. Common Crawl publish Apache Parquet indexes which can be used for this kind of thing. e.g.

Needs thinking through, and understanding if what any costs and impacts are.

Note some prior work that is related: