bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

ZSTD compression and compression level support #51

Closed ZJaume closed 7 months ago

ZJaume commented 7 months ago

Now the user can choose between ZSTD or GZIP compression format (GZIP being default) and compression level. This change is made upon #50 because it is a feature that we may really need if we use HTML output, that is considerably taking more space than only text.

The code has been changed to use boost::filtering_streambuf to easily switch between algorithms, removing the old GzipWriter class that was handling compression more manually.

To see the real differences of this PR see https://github.com/bitextor/warc2text/pull/51/commits/53709d06456b8e357fe93ce6ebb12b078ecd3084