chfoo / warcat

Tool and library for handling Web ARChive (WARC) files.
GNU General Public License v3.0
150 stars 21 forks source link

wpull WARCs cause "Content block length changed from X to Y" warnings on warcinfo record #18

Open JustAnotherArchivist opened 5 years ago

JustAnotherArchivist commented 5 years ago

WARCs from at least wpull 1.2.3 produce a warning of "Content block length changed from X to Y" for warcinfo records. Example:

> wpull --version
1.2.3
> wpull https://example.org/ --warc-file example.org --warc-max-size 1234567890 --delete-after
<snip>
> python3 -m warcat verify example.org-meta.warc.gz --verbose --verbose
INFO:warcat.model.warc:Opened gziped file example.org-meta.warc.gz
DEBUG:warcat.util:Creating buffer block file. index=0
DEBUG:warcat.util:Buffer block file created. length=4838
DEBUG:warcat.model.record:Record start at 0 0x0
DEBUG:warcat.model.field:Version line=WARC/1.0
DEBUG:warcat.model.record:Block length=3665
DEBUG:warcat.model.block:Field length=3665
DEBUG:warcat.model.block:Payload length=0
WARNING:warcat.model.record:Content block length changed from 3665 to 3656
DEBUG:warcat.model.warc:Finished reading a record <urn:uuid:eb5182ba-c3fc-41a0-8be8-649c463c5c1d>
DEBUG:warcat.util:Creating buffer block file. index=0
DEBUG:warcat.util:Buffer block file created. length=4838
DEBUG:warcat.model.binary:Creating safe file of example.org-meta.warc.gz
DEBUG:warcat.tool:Block digest ok
DEBUG:warcat.model.record:Record start at 3986 0xf92
DEBUG:warcat.model.field:Version line=WARC/1.0
DEBUG:warcat.model.record:Block length=511
DEBUG:warcat.model.block:Binary content block length=511
DEBUG:warcat.model.warc:Finished reading a record <urn:uuid:3f2124c6-b9f8-4a12-9870-d93ea45335d8>
INFO:warcat.model.warc:Finished reading Warc
DEBUG:warcat.model.binary:Creating safe file of example.org-meta.warc.gz
DEBUG:warcat.tool:Block digest ok

The difference between the numbers is exactly the same as the number of lines in that warcinfo record body. I doubt that's a coincidence, but I wasn't able to narrow down the origin based on a brief glance over the source code. I imagine it has something to do with the block length being recalculated from the normalised field representation. If that interpretation's correct, then I think the warning should be suppressed in verify mode since it should be irrelevant when not writing out a new WARC.