commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
107 stars 9 forks source link

Consider normalizing host, domain names and TLDs #25

Closed sebastian-nagel closed 1 year ago

sebastian-nagel commented 1 year ago

While the bulk of URLs in the crawls is normalized, this is not true for URLs stemming from redirects during fetching. As a result host names of URLs not normalized may include:

sebastian-nagel commented 1 year ago

Fixed for CC-MAIN-2023-14. The number of dubious TLDs (digits only or containing a percent sign) has dropped:

count crawl subset
134 CC-MAIN-2023-06 warc
211 CC-MAIN-2023-06 robotstxt
1188 CC-MAIN-2023-06 crawldiagnostics
3 CC-MAIN-2023-14 crawldiagnostics
3 CC-MAIN-2023-14 robotstxt

counts using:

select count(*) as count, crawl, subset
from "ccindex"."ccindex"
where (crawl = 'CC-MAIN-2023-06' or crawl = 'CC-MAIN-2023-14')
  and regexp_like(url_host_tld, '^\d+$|%')
group by crawl, subset;

Some more work to do, see #26.