facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.75k stars 2.11k forks source link

Use longer compression window size by default: otherwise compressing Acronis image more than twice the size compared to Winrar #3159

Closed JsBergbau closed 1 year ago

JsBergbau commented 2 years ago

Given is Acronis TrueImage Boot-CD iso Image 2021, filesize 706 MB. Compressed with Winrar filesize is 178 MB. ZSTD using maximum settings with Peazip results in 393MB, using zstd --ultra -22 -T16 AcronisTrueImage2021BootCD.iso results in 383 MB. This is still twice of the size of Winrar.

So Winrar must do some trick to get this massive compression.

pigz -v9 AcronisTrueImage2021BootCD.iso results in 675 MB, so ZSTD is already a massive improvement compared to gzip, but still I consider this example it worth to examine whats going on to improve zstd.

You can download the iso here https://archive.org/details/acronis_2021 sha256sum a907788710997da7b413d49c8ab124019e836ca6552341e92a54b3d346472059

JsBergbau commented 2 years ago

Some other figures. Realized later, that I forgot using ultra-compression for media wiki. Very impressive how much better this mode is. On the other hand, compared to Winrar using --ultra -22 is much slower than Winrar. --long=31 seems to do the trick.

used zstd: *** zstd command line interface 64-bits v1.5.2, by Yann Collet ***

Mediawiki mysql dump: 1870 MB Plaintext Winrar comressed at maxmium to 11.3 MB(!), really ZSTD -T0 -12: 66.5 MB ZSTD -T0 -19: 41.95 MB ZSTD -T0 --ultra -22: 13.5 MB ZSTD -T0 -15 --long=31: 12.44 MB GZIP default compression level: 412.6 MB

So we really should figure out why Winrar is so much better.

On the other hand, there are also other examples where ZSTD is even better than Winrar and also between the different levels not so much difference.

Matomo mysql dump: 2948 MB Plaintext Winrar at maximium compression level 443.2 MB ZSTD -T0 -12: 451,6 MB ZSTD -T0 -15: 448,6 MB ZSTD -T0 -19: 408,7 MB ZSTD -T0 --ultra -22: 403.3 MB GZIP default compression level: 554,4 MB

JsBergbau commented 2 years ago

Update: ./zstd -15 --long=31 AcronisTrueImage2021BootCD.iso compresses to 175.9 MB, so even better than Winrar.

2048 MB is not much memory. So we should consider using longer windows for default.

JsBergbau commented 2 years ago

Update: Another MySQL-Dump (typo3) 915.3 MB in size

Using: zstd -T0 -15 --long=31 results in 26.97 MB compressed file size ./zstd -T0 -15 results in 25.78 MB So in this case 4.6 % larger file size, wenn using longer compression window. Very strange.

Cyan4973 commented 2 years ago

--long mode should always be positive for high compression modes (btopt and above), starting level 16. Below that point however, --long is more like a "bet", which tends to be fine in "general" cases, but can occasionally go wrong. In this case for example, the compression factor is very high (more than x30!), which means regular matches are already very long and therefore competitive with the ones found by --long.

terrelln commented 2 years ago

We can't default to window sizes larger than 8MB, because that is the max window size we say all decoders should support in our spec. So you must explicitly opt into a larger window size with --long.

scottcarey commented 2 years ago

"2048 MB is not much memory. So we should consider using longer windows for default."

On the contrary that is a MASSIVE amount of memory. If you had a server compressing on the fly to 100 clients, that would eat up 200GB of RAM. And then every client reading from those servers would need to support a 2GB window. The per-stream memory overhead to decompress zstd is an important limit that allows it to be used in a broader set of use cases.

Not everyone uses zstd primarily for archival storage. On the contrary, because it is rather fast at both compression and decompression it is often used for dynamic data compression or use cases where the compression has to keep up with multiple continuous inbound data streams.

WinRar is nearly exclusively used for archives, usually no more than one task at a time, and can require a larger chunk of memory.

Would you want your phone to allocate 2GB of memory just to download compressed app updates from the web in the background? The 8MB default limit is what lets zstd be an option for use cases like this -- its a compromise that lets any standard/generic implementation operate in a small footprint no matter who compressed the source data.

Just beware, if you use --long:31 you are requiring anyone that decompresses the data to potentially need 2GB of RAM to do so. --long:29 might be more than enough and require 4x less RAM.

terrelln commented 1 year ago

Closing as there is no immediate action to be taken. We can't, and don't want to, default zstd to using more than 8MB of memory, unless you explicitly opt into it with --long or --ultra.

The only action we can reasonably take is to raise awareness of --long.