chatnoir-eu / chatnoir-resiliparse

A robust web archive analytics toolkit
https://resiliparse.chatnoir.eu
Apache License 2.0
55 stars 9 forks source link

CLI index: fix exception when determining the length of the last WARC record #26

Closed sebastian-nagel closed 1 year ago

sebastian-nagel commented 1 year ago

Use tell() on underlying stream (instead of fastwarc stream_io reader) to determine the file offset after last WARC record has been read. Fixes the following exception which also causes that the last WARC record is not indexed:

Traceback (most recent call last):
  File "/usr/local/bin/fastwarc", line 8, in <module>
    sys.exit(main())
  ...
  File "/usr/local/lib/python3.10/dist-packages/fastwarc/cli.py", line 283, in index
    _index_record(output, fields, preserve_multi_header, prev_record, prev_record.reader.tell(), infile)
  File "fastwarc/warc.pyx", line 704, in fastwarc.warc.WarcRecord.reader.__get__
  File "fastwarc/warc.pyx", line 491, in fastwarc.warc.WarcRecord._assert_not_stale
fastwarc.stream_io.ReaderStaleError: Reader is stale. Use freeze() if you want to retain record payload.
phoerious commented 1 year ago

Thanks. There is a test failure in the Windows build, but it seems unrelated. I triggered the build again, let's see.

phoerious commented 1 year ago

Seems like there were some changes in the uchardet library on Windows.

codecov[bot] commented 1 year ago

Codecov Report

Merging #26 (61c4a6f) into develop (11bfbde) will increase coverage by 90.54%. The diff coverage is n/a.

@@             Coverage Diff              @@
##           develop      #26       +/-   ##
============================================
+ Coverage         0   90.54%   +90.54%     
============================================
  Files            0       21       +21     
  Lines            0     3110     +3110     
============================================
+ Hits             0     2816     +2816     
- Misses           0      294      +294     
Impacted Files Coverage Δ
resiliparse/resiliparse/beam/elasticsearch.py 92.85% <0.00%> (ø)
resiliparse/resiliparse/itertools.pyx 90.16% <0.00%> (ø)
resiliparse/resiliparse/parse/lang.pxd 80.95% <0.00%> (ø)
fastwarc/fastwarc/stream_io.pyx 92.62% <0.00%> (ø)
fastwarc/fastwarc/stream_io.pxd 57.89% <0.00%> (ø)
resiliparse/resiliparse/parse/html.pxd 69.23% <0.00%> (ø)
resiliparse/resiliparse/beam/textio.py 96.59% <0.00%> (ø)
resiliparse/resiliparse/process_guard.pyx 93.80% <0.00%> (ø)
resiliparse/resiliparse/extract/html2text.pyx 91.82% <0.00%> (ø)
resiliparse/resiliparse/parse/__init__.py 100.00% <0.00%> (ø)
... and 11 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more