chatnoir-eu / chatnoir-resiliparse

A robust web archive analytics toolkit
https://resiliparse.chatnoir.eu
Apache License 2.0
55 stars 9 forks source link

FastWARC: BufferedReader may hang up on truncated gzipped WARC file #6

Closed sebastian-nagel closed 2 years ago

sebastian-nagel commented 2 years ago

The ArchiveIterator, resp. the underlying stream_io.BufferedReader when reading a truncated gzipped WARC file (eg. an incomplete download). The issue can be reproduced when reading clipped.warc.gz, see iipc/jwarc#17. The stack during the hangup (instead of ftell I've also observed stream_io.FileStream.read() on top of _refill_working_buf():

#3  0x00007f98a34f8705 in __GI__IO_ftell (fp=0x19b3790) at ioftell.c:38
#4  0x00007f98a2764766 in __pyx_f_8fastwarc_9stream_io_10GZipStream__refill_working_buf (__pyx_v_self=0x7f98a19fad60, __pyx_v_size=16384)
    at fastwarc/stream_io.cpp:4944
#5  0x00007f98a276d500 in __pyx_f_8fastwarc_9stream_io_10GZipStream_read (__pyx_v_self=0x7f98a19fad60, __pyx_v_out="", __pyx_v_size=16384)
    at fastwarc/stream_io.cpp:5191
#6  0x00007f98a27645bc in __pyx_f_8fastwarc_9stream_io_14BufferedReader__fill_buf (__pyx_v_self=0x7f98a19fb9a0) at fastwarc/stream_io.cpp:9201
#7  0x00007f98a276ce6b in __pyx_f_8fastwarc_9stream_io_14BufferedReader_read (__pyx_v_self=0x7f98a19fb9a0, __pyx_skip_dispatch=<optimized out>, 
    __pyx_optional_args=<optimized out>) at fastwarc/stream_io.cpp:9684
#8  0x00007f98a2765d75 in __pyx_pf_8fastwarc_9stream_io_14BufferedReader_4read (__pyx_v_size=16384, __pyx_v_self=0x7f98a19fb9a0)
    at fastwarc/stream_io.cpp:9840
phoerious commented 2 years ago

The question is: is this expected behaviour or not? It's invalid input and you would want some sort of error thrown.

EDIT: ah, I see. The content reader hangs. That shouldn't happen.

phoerious commented 2 years ago

Fixed together with some other issues. New binaries should be on PyPi in a few minutes.

potthast commented 2 years ago

The goal as I understood it is not just resilience of large-scale data processing jobs with respect to, e.g., extreme or invalid HTML files, but also resilience against errors occurring on other parts of the processing pipeline. It would be wasteful if a million-WARC file processing job fails because of a single corrupt WARC file.

At any rate, can recoverable errors be logged (on demand)?

EDIT: This comment relates to the previous one on whether this was expected behavior.

phoerious commented 2 years ago

Of course. But resilience also means that you should be able to react on errors. With the fix, the processing pipeline just continues without errors, even if the GZip stream is truncated, which is fine I believe (it shouldn't hang in any case, which is one of the major issues I've had with previous pipelines and the whole reason Resiliparse has TimeGuard and MemoryGuard). In fact, I wonder if this error should be logged at all or if it should be up to the user to detect this kind of issue. As a user you could compare the stream content length with the Content-Length header or verify the record digests if you worry about truncated records. So yes, not throwing an unexpected exception wouldn't be desirable here, I would say.

potthast commented 2 years ago

Regarding logging, I guess we should not have expectations about whatever goes on in different parts of operations that involve processing WARC files at scale. Rather, if we have knowledge of an error, then it makes sense to tell the user about it---albeit, maybe only on demand.

So, what's the most common wish users have from their tools? Silent by default, and noisy on demand? Or the other way around?

If more extensive logging is introduced, it creates lots of extra plumbing (e.g., where does the tool store the logs and can this be adjusted, logging server connections in case of distributed usage, etc.). But in the long run, such facilities might be asked for, anyway, given the professional context of resilient large-scale processing that is the target audience of this tool.

phoerious commented 2 years ago

For performance reasons, I would refrain from adding intensive logging at the moment.