Open sebastian-nagel opened 1 year ago
1.b did some detective work and found where the URL came from
zstdgrep -a "https://rosecollection.brandeis.edu/objects-1/portfolio?records=12&query" wat_seeds/wat.part-r-00383.zst | more
it was a link coming from a WAT which did not get filtered out and ended up being selected.
There is currently no mechnism in Nutch to simply filter a URL by length. @sebastian-nagel had planned to do it
<property><name>urlfilter.fast.url.path.max.length</name><value>1024</value></property>
<property><name>urlfilter.fast.url.pathquery.max.length</name><value>2048</value></property>
but it hasn't been implemented in the urlfilter.fast
During fetching, URLs are filtered based on urlfilter.fast alone so best to do it there.
I will create a new issue for this. We need this before starting the next crawl.
The culprit can be found below.
Should run a dummy crawl with it and investigate why 1.B happened (what truncated it)
@wumpus check that something needs fixing in some Python library
If a WARC request record contains and overlong and truncated HTTP request header line (
GET /path HTTP/1.1
) HttpRequestMessageParser throws an exception which causes that the request record is not transformed into a WAT record. If the exception is not handled in the calling code, even the WAT/WET extractor job (commoncrawl/ia-hadoop-tools) may fail.The issue was observed on a couple of WARC files of CC-MAIN-2023-40:
GET /path-truncated
which caused the HttpRequestMessageParser to fail (no HTTP version). Investigate in separate issues a. why the truncation happened (commoncrawl/nutch: in the WARC writer or at the protocol level recording the HTTP communication between crawler and web server)? b. how these URLs stem from and whether the URL filters need to be tightened to avoid similar errors.Response message to long
CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request.warc.gz CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request-only.warc.gz