chatnoir-eu / chatnoir-resiliparse

A robust web archive analytics toolkit
https://resiliparse.chatnoir.eu
Apache License 2.0
80 stars 11 forks source link

pipx run fastwarc check faild: binascii.Error: Non-base32 digit found #19

Closed MaxPeal closed 2 years ago

MaxPeal commented 2 years ago
$ pipx run --verbose fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
pipx >(setup:729): pipx version is 1.0.0
pipx >(setup:730): Default python interpreter is '/home/user/.local/pipx/venvs/pipx/bin/python'
pipx >(needs_upgrade:69): Time since last upgrade of shared libs, in seconds: 1561898. Upgrade will be run by pipx if greater than 2592000.
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(run:103): Reusing cached venv /home/user/.local/pipx/.cache/7a73b1e86637c39
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(exec_app:387): exec_app: /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
0 records were verified successfully.                           
1 records were skipped without digest.
Error in sys.excepthook:
Traceback (most recent call last):
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
    for v in pbar:
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "fastwarc/tools.pyx", line 178, in verify_digests
  File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
  File "/usr/lib/python3.9/base64.py", line 231, in b32decode
    raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found

Original exception was:
Traceback (most recent call last):
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
    for v in pbar:
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "fastwarc/tools.pyx", line 178, in verify_digests
  File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
  File "/usr/lib/python3.9/base64.py", line 231, in b32decode
    raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found
$
phoerious commented 2 years ago

This looks like a non-standard WARC record digest. Can you post an example of where this happens?

MaxPeal commented 2 years ago

i created the WARC file with https://github.com/internetarchive/warcprox

MaxPeal commented 2 years ago
(venv) user@box:/tmp/warcs$ sha1sum WARCPROX-20220315191329244-00000-icvgw961.warc* | tee WARCPROX-20220315191329244-00000-icvgw961.warc.sha1
5cfa65c0cb6cf7aeed36be9a812dedbd7d2f7add  WARCPROX-20220315191329244-00000-icvgw961.warc
c220d5ea3067eadb3ae6caa39b3ac919eeccb23e  WARCPROX-20220315191329244-00000-icvgw961.warc.tar.gz
(venv) user@box:/tmp/warcs$ 

WARCPROX-20220315191329244-00000-icvgw961.warc.tar.gz

phoerious commented 2 years ago

The file you uploaded (although the hash matches the one you posted), is not a valid GZip file, so I cannot open it.

phoerious commented 2 years ago

The file seems to be a mixture of text and binary, but I can see what your original problem is: the digest hash is stored as hex, not as Base32, which is required by the WARC spec.

I'll add support for that later, but it's non-standard and worth a bug report to warcprox.

MaxPeal commented 2 years ago

i packed the WARC file with tar. i'm missing something? Jhove installed via apt on debian 11 say its valid?

(venv) user@box:/tmp$ jhove -k warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
Jhove (Rel. 1.20.0, 2019-01-19)
 Date: 2022-03-16 18:41:20 CET
 RepresentationInformation: warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
  ReportingModule: BYTESTREAM, Rel. 1.3 (2007-04-10)
  LastModified: 2022-03-15 20:13:31 CET
  Size: 16927625
  Format: bytestream
  Status: Well-Formed and valid
  MIMEtype: application/octet-stream
  Checksum: 2371829d
   Type: CRC32
  Checksum: 297bd32582ca019fb5922efb8d74b1a4
   Type: MD5
  Checksum: 5cfa65c0cb6cf7aeed36be9a812dedbd7d2f7add
   Type: SHA-1
(venv) user@box:/tmp$ 
MaxPeal commented 2 years ago

I'm wrong with this interpretation? feedback are welcome.

if i don't miss read the discussion about a specifications clarification: the digest hash stored as hex, not as Base32, is possible by the WARC spec. https://github.com/iipc/warc-specifications/issues/29 https://github.com/webrecorder/warcio/issues/74#issuecomment-487816378

phoerious commented 2 years ago

i packed the WARC file with tar.

Yeah, I figured. But no, a tar does not make a valid WARC file and tar is also no compression algorithm. A compressed WARC file is a series of records that are compressed individually with the gzip tool. I do not recommend that you try to do that manually. An uncompressed .warc file is perfectly valid, although space-inefficient.

the digest hash stored as hex, not as Base32, is possible by the WARC spec.

The WARC specification makes no mention of hex-encoded digests. As per the specification, these should be Base32, although it only mentions it as an example and does not explicitly say that no other encoding is allowed: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/#warc-block-digest

phoerious commented 2 years ago

FastWARC now supports hex-digests. The new wheels should be up on PyPi as soon as this is done: https://github.com/chatnoir-eu/chatnoir-resiliparse/actions/runs/1995457297