alard / wget-warc

This is an old version of the WARC patches. Wget v1.14 and higher has WARC support.
https://www.gnu.org/software/wget/
4 stars 3 forks source link

Should handle WARC max file size #1

Closed alard closed 13 years ago

alard commented 13 years ago

If the WARC file reaches its maximum size (which is set to 1GB at the moment), WFile_storeRecord returns an error. wget-warc should open a new WARC file if this happens.

db48x commented 13 years ago

The maximum size should be configurable as well.

db48x commented 13 years ago

I just pushed a change that adds an option for the maximum size, defaulting to 1GB. I see you've already implemented the failover to a new file, but it has a bug: the record that would have pushed it over the maximum doesn't get written to the new file.

db48x commented 13 years ago

actually, it's just the records that are larger than the allowed size of the warc file that don't get written. Testing with a 1k warc file hits that fairly often :)

alard commented 13 years ago

I changed the way the file size is handled. The file size is now checked after the record has been written. This means that files can become (somewhat) larger than the limit, but I think it's better this way (and Heritrix does this too):

db48x commented 13 years ago

Ah, that looks good. I think that's about it for this one.