commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

Allow to use a custom table schema #17

Closed sebastian-nagel closed 2 years ago

sebastian-nagel commented 2 years ago

The schema of the table created by CCIndex2Table is fixed to the built-in schema used by/for Common Crawl. In order to support other crawl archives, it would be optimal to keep the table schema configurable:

  1. allow to pass a custom-defined table schema (as JSON file) which defines the output table

  2. (eventually) split the class into a generic one (requiring a custom schema) and a CC-specific one. This would also allow to more easily adapt the parsing of a custom CDX input.