Closed MaxPeal closed 2 years ago
This looks like a non-standard WARC record digest. Can you post an example of where this happens?
i created the WARC file with https://github.com/internetarchive/warcprox
(venv) user@box:/tmp/warcs$ sha1sum WARCPROX-20220315191329244-00000-icvgw961.warc* | tee WARCPROX-20220315191329244-00000-icvgw961.warc.sha1
5cfa65c0cb6cf7aeed36be9a812dedbd7d2f7add WARCPROX-20220315191329244-00000-icvgw961.warc
c220d5ea3067eadb3ae6caa39b3ac919eeccb23e WARCPROX-20220315191329244-00000-icvgw961.warc.tar.gz
(venv) user@box:/tmp/warcs$
The file you uploaded (although the hash matches the one you posted), is not a valid GZip file, so I cannot open it.
The file seems to be a mixture of text and binary, but I can see what your original problem is: the digest hash is stored as hex, not as Base32, which is required by the WARC spec.
I'll add support for that later, but it's non-standard and worth a bug report to warcprox.
i packed the WARC file with tar. i'm missing something? Jhove installed via apt on debian 11 say its valid?
(venv) user@box:/tmp$ jhove -k warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
Jhove (Rel. 1.20.0, 2019-01-19)
Date: 2022-03-16 18:41:20 CET
RepresentationInformation: warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
ReportingModule: BYTESTREAM, Rel. 1.3 (2007-04-10)
LastModified: 2022-03-15 20:13:31 CET
Size: 16927625
Format: bytestream
Status: Well-Formed and valid
MIMEtype: application/octet-stream
Checksum: 2371829d
Type: CRC32
Checksum: 297bd32582ca019fb5922efb8d74b1a4
Type: MD5
Checksum: 5cfa65c0cb6cf7aeed36be9a812dedbd7d2f7add
Type: SHA-1
(venv) user@box:/tmp$
I'm wrong with this interpretation? feedback are welcome.
if i don't miss read the discussion about a specifications clarification: the digest hash stored as hex, not as Base32, is possible by the WARC spec. https://github.com/iipc/warc-specifications/issues/29 https://github.com/webrecorder/warcio/issues/74#issuecomment-487816378
i packed the WARC file with tar.
Yeah, I figured. But no, a tar does not make a valid WARC file and tar is also no compression algorithm. A compressed WARC file is a series of records that are compressed individually with the gzip tool. I do not recommend that you try to do that manually. An uncompressed .warc file is perfectly valid, although space-inefficient.
the digest hash stored as hex, not as Base32, is possible by the WARC spec.
The WARC specification makes no mention of hex-encoded digests. As per the specification, these should be Base32, although it only mentions it as an example and does not explicitly say that no other encoding is allowed: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/#warc-block-digest
FastWARC now supports hex-digests. The new wheels should be up on PyPi as soon as this is done: https://github.com/chatnoir-eu/chatnoir-resiliparse/actions/runs/1995457297