ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

FTP grabs non-functional WARCs #87

Closed rwoodpecker closed 8 years ago

rwoodpecker commented 8 years ago

First off, apologies if this is intended behaviour (or inherent in WARCs or something but it doesn't seem right to me)

I've just tried a few grabs tonight on FTPs. Latest version of grab-site, no special arguments. Everything finishes as expected. However the WARCs just seem to be broken. If I try to extract the files with warcat or an unzip tool that handles WARCs.. nothing comes out. I also tried a warcat verify and it reports thousands of 'problems', one for every record.

File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 345, in verify_concurrent_to record_id), major=False) warcat.tool.VerifyProblem: ('Concurrent Record ID urn:uuid: not seen yet', None, False)

webarchiveplayer literally displays the WARC as empty - but the file size is proportional to the grab.

I can reproduce this across any FTP on any server. HTTP(S) grabs have no such problems and everything works as expected, verifies and extracts just fine.

ivan commented 8 years ago

It looks like you're hitting https://github.com/chfoo/warcat/blob/master/warcat/tool.py#L344

I think WARC-Concurrent-To: on a response record is supposed to point to the request record, but for FTP WARCs, wpull writes a value that refers to something that isn't written to the WARC. It doesn't look like FTP "requests" are recorded at all. wpull says it follows the Heritrix spec for FTP WARCs, but this blog post suggests Heritrix writes blank WARC-Concurrent-To: values for FTP WARCs.

In any case, please file this on wpull, because grab-site doesn't have any control over how WARCs are written.

ivan commented 8 years ago

You can also try commenting out https://github.com/chfoo/warcat/blob/b97eed3b34b04133707074537331c71d66a412c4/warcat/tool.py#L271 in your copy of warcat and see if it works when skipping the verification step.

rwoodpecker commented 8 years ago

Well it verified fine commenting out line 271 of tool.py!

I'l file this on wpull at some stage today.

Thanks.