Fetcher: filter and verify robots.txt responses before archiving

Until now the Fetcher archives the robots.txt files unconditionally. It does not

verify that the URL filters active in Fetcher do not exclude the URL of the robots.txt
the robots.txt file is allowed by the robots.txt of the target host in case it is redirected across authorities (hosts)
the response is actually a robots.txt file resp. the MIME type is text/plain or any other MIME type indicating a text file.

These points raise two issues affecting the robots.txt dataset:

A. Because the robots.txt RFC draft requires that redirects are followed even if they lead to a different host/domain, and robots.txt requests cannot be checked against the robots.txt of the target host ahead, the robots.txt WARC files may include content disallowed by the robots.txt of the hosting server.

B. The data set includes also HTML pages and other MIME types not expected as response for /robots.txt, below the counts of the June 2021 crawl:

n_pages	perc_p	perc_w	content_mime_detected
55502794	93.451	63.159	text/plain
3100743	5.221	32.762	text/html
380478	0.641	3.034	application/xhtml+xml
275014	0.463	0.370	message/rfc822
65928	0.111	0.087	text/txt
28735	0.048	0.193	application/javascript
6263	0.011	0.007	text/text
6035	0.010	0.005	text/*
5727	0.010	0.008	application/json
3543	0.006	0.005	application/rtf
2807	0.005	0.006	text/x-php
2789	0.005	0.012	application/xml
1961	0.003	0.002	text/plan
1619	0.003	0.002	text/css
1278	0.002	0.001	application/x-ms-owner

In terms of WARC storage even more than 30% are used by HTML pages.

The counts were obtained by the following query on the columnar index:

SELECT COUNT(*) as n_pages,
       COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as perc_p,
       SUM(warc_record_length) * 100.0 / SUM(SUM(warc_record_length)) OVER() as perc_w,
       content_mime_detected
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2021-25'
  AND subset = 'robotstxt'
  AND fetch_status = 200
GROUP BY content_mime_detected
ORDER BY n_pages DESC;

Fixed with/for July/August crawl ([CC-MAIN-2021-31]()), the counts of MIME types is now mostly as expected:	n_pages	perc_p	perc_w
58173609	99.148	99.033	text/plain
412787	0.704	0.813	message/rfc822
64969	0.111	0.121	text/txt
11089	0.019	0.013	text/*
7315	0.012	0.012	text/text
2067	0.004	0.004	text/plan
1064	0.002	0.001	text/plain%0a
439	0.001	0.001	text/plane
240	0.000	0.000	text/pain
50	0.000	0.000	text/ascii
29	0.000	0.000	text/plaintxt
25	0.000	0.000	text/plai
17	0.000	0.000	text/txt-format
7	0.000	0.000	text/ansi
7	0.000	0.000	text/plant
6	0.000	0.000	text/plaintext
1	0.000	0.000	text/plains
1	0.000	0.000	text/robotstxt+raw
1	0.000	0.000	text/textplane

message/rfc822 is erroneous and is addressed in TIKA-3489
note that these numbers only include successful fetches (HTTP status 200) and unwanted MIME types are still allowed as payload of redirects, 404s, etc.

commoncrawl / nutch

Fetcher: filter and verify robots.txt responses before archiving #19