Closed sebastian-nagel closed 1 year ago
Fixed for CC-MAIN-2023-14. The number of dubious TLDs (digits only or containing a percent sign) has dropped:
count | crawl | subset |
---|---|---|
134 | CC-MAIN-2023-06 | warc |
211 | CC-MAIN-2023-06 | robotstxt |
1188 | CC-MAIN-2023-06 | crawldiagnostics |
3 | CC-MAIN-2023-14 | crawldiagnostics |
3 | CC-MAIN-2023-14 | robotstxt |
counts using:
select count(*) as count, crawl, subset
from "ccindex"."ccindex"
where (crawl = 'CC-MAIN-2023-06' or crawl = 'CC-MAIN-2023-14')
and regexp_like(url_host_tld, '^\d+$|%')
group by crawl, subset;
Some more work to do, see #26.
While the bulk of URLs in the crawls is normalized, this is not true for URLs stemming from redirects during fetching. As a result host names of URLs not normalized may include: