chatnoir-eu / chatnoir-resiliparse

A robust web archive analytics toolkit
https://resiliparse.chatnoir.eu
Apache License 2.0
55 stars 9 forks source link

Fastwarc: CLI may index gzipped WARC records with erroneous length 0 #13

Closed sebastian-nagel closed 2 years ago

sebastian-nagel commented 2 years ago

The fastwarc command-line tool "index" index some records of a gzipped WARC file with an erroneous zero record length:

$> wget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/09/CC-NEWS-20210930113548-00741.warc.gz

$> fastwarc index -fwarc-type,warc-target-uri,offset,length CC-NEWS-20210930113548-00741.warc.gz \
    | grep -F '"length": "0"'
{"warc-type": "response", "warc-target-uri": "https://www.themarketsdaily.com/2021/09/30/ishares-sp-500-etf-nysearcaivv-sees-strong-trading-volume.html", "offset": "232757027", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.timeturk.com/yasam/baskan-buyukkilic-10-milyon-tl-yatirim-yapilan-yeralti-carsisi-nda-incelemede-bulundu-esnafi-ziyaret-etti/haber-1703634", "offset": "278528237", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.sondakika.com/haber/haber-yayinci-tevfik-rauf-baysal-vefat-etti-14429565/", "offset": "1044381471", "length": "0"}

See also the discussion in #11, however, fewer records are affected here. With uncompressed the WARC file the error is not reproducible.

phoerious commented 2 years ago

Thanks, I'll check what may be the reason for the remaining records.

With uncompressed the WARC file the error is not reproducible.

Yes, that would be very unexpected, since without compression, everything is very straight-forward. With compressed records, offset calculation is more difficult.

phoerious commented 2 years ago

Turns out, consume() can already skip over to the next GZip member in some cases. I don't really know when this happens, but it's probably to do with buffer refills at member boundaries. The easiest way to fix this was to simply use the next/previous record for calculating the length similar to what you did in your first draft.

phoerious commented 2 years ago

v0.6.1 is underway: https://github.com/chatnoir-eu/chatnoir-resiliparse/actions/runs/1304376163

I also fixed an LZ4 buffer skip error that could result in incomplete WARC reads, so that warrants another patch release.