Closed rwoodpecker closed 8 years ago
It looks like you're hitting https://github.com/chfoo/warcat/blob/master/warcat/tool.py#L344
I think WARC-Concurrent-To:
on a response record is supposed to point to the request record, but for FTP WARCs, wpull writes a value that refers to something that isn't written to the WARC. It doesn't look like FTP "requests" are recorded at all. wpull says it follows the Heritrix spec for FTP WARCs, but this blog post suggests Heritrix writes blank WARC-Concurrent-To:
values for FTP WARCs.
In any case, please file this on wpull, because grab-site doesn't have any control over how WARCs are written.
You can also try commenting out https://github.com/chfoo/warcat/blob/b97eed3b34b04133707074537331c71d66a412c4/warcat/tool.py#L271 in your copy of warcat and see if it works when skipping the verification step.
Well it verified fine commenting out line 271 of tool.py!
I'l file this on wpull at some stage today.
Thanks.
First off, apologies if this is intended behaviour (or inherent in WARCs or something but it doesn't seem right to me)
I've just tried a few grabs tonight on FTPs. Latest version of grab-site, no special arguments. Everything finishes as expected. However the WARCs just seem to be broken. If I try to extract the files with warcat or an unzip tool that handles WARCs.. nothing comes out. I also tried a warcat verify and it reports thousands of 'problems', one for every record.
File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 345, in verify_concurrent_to record_id), major=False) warcat.tool.VerifyProblem: ('Concurrent Record ID urn:uuid: not seen yet', None, False)
webarchiveplayer literally displays the WARC as empty - but the file size is proportional to the grab.
I can reproduce this across any FTP on any server. HTTP(S) grabs have no such problems and everything works as expected, verifies and extracts just fine.