commoncrawl / nutch

Common Crawl fork of Apache Nutch
Apache License 2.0
27 stars 2 forks source link

Fetcher: filter and verify robots.txt responses before archiving #19

Closed sebastian-nagel closed 3 years ago

sebastian-nagel commented 3 years ago

Until now the Fetcher archives the robots.txt files unconditionally. It does not

  1. verify that the URL filters active in Fetcher do not exclude the URL of the robots.txt

  2. the robots.txt file is allowed by the robots.txt of the target host in case it is redirected across authorities (hosts)

  3. the response is actually a robots.txt file resp. the MIME type is text/plain or any other MIME type indicating a text file.

These points raise two issues affecting the robots.txt dataset:

A. Because the robots.txt RFC draft requires that redirects are followed even if they lead to a different host/domain, and robots.txt requests cannot be checked against the robots.txt of the target host ahead, the robots.txt WARC files may include content disallowed by the robots.txt of the hosting server.

B. The data set includes also HTML pages and other MIME types not expected as response for /robots.txt, below the counts of the June 2021 crawl:

n_pages perc_p perc_w content_mime_detected
55502794 93.451 63.159 text/plain
3100743 5.221 32.762 text/html
380478 0.641 3.034 application/xhtml+xml
275014 0.463 0.370 message/rfc822
65928 0.111 0.087 text/txt
28735 0.048 0.193 application/javascript
6263 0.011 0.007 text/text
6035 0.010 0.005 text/*
5727 0.010 0.008 application/json
3543 0.006 0.005 application/rtf
2807 0.005 0.006 text/x-php
2789 0.005 0.012 application/xml
1961 0.003 0.002 text/plan
1619 0.003 0.002 text/css
1278 0.002 0.001 application/x-ms-owner

In terms of WARC storage even more than 30% are used by HTML pages.

The counts were obtained by the following query on the columnar index:

SELECT COUNT(*) as n_pages,
       COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as perc_p,
       SUM(warc_record_length) * 100.0 / SUM(SUM(warc_record_length)) OVER() as perc_w,
       content_mime_detected
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2021-25'
  AND subset = 'robotstxt'
  AND fetch_status = 200
GROUP BY content_mime_detected
ORDER BY n_pages DESC;
sebastian-nagel commented 3 years ago
Fixed with/for July/August crawl ([CC-MAIN-2021-31]()), the counts of MIME types is now mostly as expected: n_pages perc_p perc_w content_mime_detected
58173609 99.148 99.033 text/plain
412787 0.704 0.813 message/rfc822
64969 0.111 0.121 text/txt
11089 0.019 0.013 text/*
7315 0.012 0.012 text/text
2067 0.004 0.004 text/plan
1064 0.002 0.001 text/plain%0a
439 0.001 0.001 text/plane
240 0.000 0.000 text/pain
50 0.000 0.000 text/ascii
29 0.000 0.000 text/plaintxt
25 0.000 0.000 text/plai
17 0.000 0.000 text/txt-format
7 0.000 0.000 text/ansi
7 0.000 0.000 text/plant
6 0.000 0.000 text/plaintext
1 0.000 0.000 text/plains
1 0.000 0.000 text/robotstxt+raw
1 0.000 0.000 text/textplane