chfoo / warcat

Tool and library for handling Web ARChive (WARC) files.
GNU General Public License v3.0
150 stars 21 forks source link

Feature: extract WARCs specified with index/length #7

Open gwern opened 8 years ago

gwern commented 8 years ago

In some of the mega WARCs produced by Archive Team, extracting all the WARCs to save just a few is infeasible as it can take at least 2 days to extract them all using warcat.

One might have already checked the CDX files (to find which mega WARC to download) and so know the index and length. If you know this, it's possible to seek directly in the WARC and extract the sequence of bytes which make up a particular WARC. For example, using a cdx line like

[...] unk - HOYKQ63N2D6UJ4TOIXMOTUD4IY7MP5HM - - 1326824 19810951910 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

I can handwrite the extraction using dd:

$ dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=2.warc.gz bs=1 && gunzip 1.warc.gz
1326824+0 records in
1326824+0 records out
1326824 bytes (1.3 MB) copied, 14.6218 s, 90.7 kB/s

Which is >11,200x faster than extracting everything in warcat and looking for the file I need.

The downside is needing to mess with dd, being totally inaccessible to non-programmers, being inconvenient in terms of scripting, etc.

It'd be great if warcat could include some additional arguments to the extract functionality like a pair of --length=n and --index=i flags to provide a nicer interface to pulling out a few warcs.

This would also go very well with HTTP Range support; then you could look up the index/length in a CDX file, seek right to the specific binary sequence on Archive.org, and download only the few MB you need instead of, say, a giant 52GB megawarc. (You could imagine doing a on-demand extraction service using this: store only the master index on your server, and when a user requests a particular file, extract the WARC index/length from the master index, call warcat to extract the specific WARC from the IA-hosted megawarc, and return that to the user. So you don't need to store all 9tb or whatever.)

chfoo commented 8 years ago

Also to add that currently Warcat uses Python's built in HTTP library which does not handle edge cases that web browsers do.