ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Document --no-warc-compression #131

Closed Dri0m closed 2 years ago

Dri0m commented 6 years ago

The compression is unwanted e.g. when i'm scraping on a drive with filesystem compression, or when I want to use a strong compression algo after i'm done scraping.

ethus3h commented 6 years ago

You can use --wpull-args=--no-warc-compression to do this, by the way.

Dri0m commented 6 years ago

That's good to know, thanks! I guess this issue is solved then.

ivan commented 6 years ago

Thanks ethus3h, I'll probably just document that in the README.

Also note that request/response records in .warc.gz files are individually compressed, and if you plan to ever send them to Internet Archive, I believe they expect them to be compressed that way. Running gzip on an uncompressed .warc will not compress the records individually, so random access will not work.

ivan commented 2 years ago

Now documented in 6269289a2ca874bae52f116016ca54dc8887d0cc