commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

Integrate end-of-term archive table conversion tool #34

Open sebastian-nagel opened 1 week ago

sebastian-nagel commented 1 week ago

This PR integrates the EOT archive table conversion tool into the main branch.

Background: a prototype converter was implemented in 2022, see commoncrawl/cc-index-table@8e0b776.

@vphill, @ibnesayeed - we never discussed this. It's equally ok, if you want to maintain the adapted converter separately.