commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

Column with host name reverse domain name notation #14

Closed sebastian-nagel closed 2 years ago

sebastian-nagel commented 2 years ago

Add column url_host_name_reversed - host name in reverse domain name notation (com.example.www).

Objectives:

Thanks to @wumpus and @cldellow for the discussion (a while ago) which led to this improvement.

sebastian-nagel commented 2 years ago

The column url_host_name_reversed is included in the columnar index starting with CC-MAIN-2021-49. Ev. to be added to earlier crawls later with further data format improvements.

Comparing the performance of 3 queries to Athena using url_host_name, url_host_name_reversed resp. url_surtkey to count captures from the host commoncrawl.org:

Ok, the last query using url_surtkey is not equivalent as it also includes the host name www.commoncrawl.org. The equivalent query using the reversed host name is less efficient (maybe the min/max stats are not used here?):