chfoo / warcat

Tool and library for handling Web ARChive (WARC) files.
GNU General Public License v3.0
150 stars 21 forks source link

Extract performance is extremely slow on megawarcs #9

Open gwern opened 8 years ago

gwern commented 8 years ago

I was recently working with a megawarc from the Google Reader crawl of 25GB or so in size on an Amazon EC2 server. This took a few hours to download, and from past experience with gunzip, I know it would take a similar amount of time to decompress and write to disk.

I tried running warcat's extraction feature on it, reasoning that it would run at near-gunzip speed since this is a stream processing task and the IO to write each warc to a different file should have minimal overhead. Instead, it was extremely slow despite being the only thing running on that server, taking what seemed like multiple seconds to extract each file. In top, warcat was using 100% CPU, though un-gzipping is something that should be IO-bound, not CPU-bound (which indicates an algorithmic problem somewhere to me). After 3 days, it still had not extracted all files from the megawarc and I believe it was less than 3/4s done; unfortunately, it crashed at some point on the third day, so I didn't find out how long it would take. (I also don't know why the crash happened - and at 3 days to get to another crash with more logging, I wasn't going to find out.)

This slowness makes warcat not very useful for working with a megawarc and I wound up looking for a completely different approach (dd using the CDX metadata on index/length of the specific warcs I needed).

chfoo commented 8 years ago

Yeah, the thing is ridiculously slow. You won't expect ungzip performance because it's not just ungzip. When it's extracting, it has to parse each WARC record because it's human-readable and string processing in Python is not that great.

I've been meaning to write a faster application implementation but I haven't gotten the incentive to do so. :disappointed:

I am open to any suggestions.