Closed sebastian-nagel closed 2 years ago
Thanks, I'll check what may be the reason for the remaining records.
With uncompressed the WARC file the error is not reproducible.
Yes, that would be very unexpected, since without compression, everything is very straight-forward. With compressed records, offset calculation is more difficult.
Turns out, consume()
can already skip over to the next GZip member in some cases. I don't really know when this happens, but it's probably to do with buffer refills at member boundaries. The easiest way to fix this was to simply use the next/previous record for calculating the length similar to what you did in your first draft.
v0.6.1 is underway: https://github.com/chatnoir-eu/chatnoir-resiliparse/actions/runs/1304376163
I also fixed an LZ4 buffer skip error that could result in incomplete WARC reads, so that warrants another patch release.
The fastwarc command-line tool "index" index some records of a gzipped WARC file with an erroneous zero record length:
See also the discussion in #11, however, fewer records are affected here. With uncompressed the WARC file the error is not reproducible.