Closed sebastian-nagel closed 2 years ago
The column url_host_name_reversed
is included in the columnar index starting with CC-MAIN-2021-49. Ev. to be added to earlier crawls later with further data format improvements.
Comparing the performance of 3 queries to Athena using url_host_name
, url_host_name_reversed
resp. url_surtkey
to count captures from the host commoncrawl.org
:
url_host_name
=> Run time: 5.071 sec, Data scanned: 14.47 MB
SELECT COUNT(*) AS n_captures,
url_host_name,
subset
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2021-49'
AND url_host_name = 'commoncrawl.org'
GROUP BY url_host_name, subset
url_host_name_reversed
: Run time: 2.526 sec, Data scanned: 4.93 MB
SELECT COUNT(*) AS n_captures,
url_host_name_reversed,
subset
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2021-49'
AND url_host_name_reversed = 'org.commoncrawl'
GROUP BY url_host_name_reversed, subset
url_surtkey
: Run time: 14.308 sec, Data scanned: 38.95 GB
SELECT COUNT(*) AS n_captures,
split(url_surtkey, ')', 2)[1] AS url_surt_host,
subset
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2021-49'
AND strpos(url_surtkey, 'org,commoncrawl)') = 1
GROUP BY split(url_surtkey, ')', 2)[1], subset
Ok, the last query using url_surtkey
is not equivalent as it also includes the host name www.commoncrawl.org
. The equivalent query using the reversed host name is less efficient (maybe the min/max stats are not used here?):
url_host_name_reversed
: Run time: 4.345 sec, Data scanned: 1.26 GB
SELECT COUNT(*) AS n_captures,
url_host_name_reversed,
subset
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2021-49'
AND contains(ARRAY ['org.commoncrawl', 'org.commoncrawl.www'], url_host_name_reversed)
GROUP BY url_host_name_reversed, subset
Add column
url_host_name_reversed
- host name in reverse domain name notation (com.example.www
).null
Objectives:
url_host_name
the min/max intervals should be smaller / more preciseurl_host_name_reversed
should be significantly smaller in the Parquet file and also more efficientThanks to @wumpus and @cldellow for the discussion (a while ago) which led to this improvement.