Closed sebastian-nagel closed 3 years ago
Fixed with/for July/August crawl ([CC-MAIN-2021-31]()), the counts of MIME types is now mostly as expected: | n_pages | perc_p | perc_w | content_mime_detected |
---|---|---|---|---|
58173609 | 99.148 | 99.033 | text/plain | |
412787 | 0.704 | 0.813 | message/rfc822 | |
64969 | 0.111 | 0.121 | text/txt | |
11089 | 0.019 | 0.013 | text/* | |
7315 | 0.012 | 0.012 | text/text | |
2067 | 0.004 | 0.004 | text/plan | |
1064 | 0.002 | 0.001 | text/plain%0a | |
439 | 0.001 | 0.001 | text/plane | |
240 | 0.000 | 0.000 | text/pain | |
50 | 0.000 | 0.000 | text/ascii | |
29 | 0.000 | 0.000 | text/plaintxt | |
25 | 0.000 | 0.000 | text/plai | |
17 | 0.000 | 0.000 | text/txt-format | |
7 | 0.000 | 0.000 | text/ansi | |
7 | 0.000 | 0.000 | text/plant | |
6 | 0.000 | 0.000 | text/plaintext | |
1 | 0.000 | 0.000 | text/plains | |
1 | 0.000 | 0.000 | text/robotstxt+raw | |
1 | 0.000 | 0.000 | text/textplane |
message/rfc822
is erroneous and is addressed in TIKA-3489
Until now the Fetcher archives the robots.txt files unconditionally. It does not
verify that the URL filters active in Fetcher do not exclude the URL of the robots.txt
the robots.txt file is allowed by the robots.txt of the target host in case it is redirected across authorities (hosts)
the response is actually a
robots.txt
file resp. the MIME type istext/plain
or any other MIME type indicating a text file.These points raise two issues affecting the robots.txt dataset:
A. Because the robots.txt RFC draft requires that redirects are followed even if they lead to a different host/domain, and robots.txt requests cannot be checked against the robots.txt of the target host ahead, the robots.txt WARC files may include content disallowed by the robots.txt of the hosting server.
B. The data set includes also HTML pages and other MIME types not expected as response for
/robots.txt
, below the counts of the June 2021 crawl:In terms of WARC storage even more than 30% are used by HTML pages.
The counts were obtained by the following query on the columnar index: