facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.85k stars 2.11k forks source link

zstd --patch-from very inefficient for files about 2GiB #4182

Closed reyqn closed 4 weeks ago

reyqn commented 4 weeks ago

Describe the bug I'm using patch-from to generate diffs for very big files, and the files being over the 2GiB limit I truncate them before calling zstd. However, if I truncate them to be 2GiB, patch-from is very inefficient (truncating at 2GB makes the diffs about 40MiB, 2GiB generates a 1.8GiB diff, that's more than 40x less efficient).

I noticed than if I truncate somewhere between 2GB and 2GiB, the more I approach 2GiB the less patch-from is efficient.

To Reproduce Steps to reproduce the behavior:

  1. Take two files of different versions about 2GiB
  2. Call zstd --patch-from v1 v2
  3. The generated zst file is huge, even if zstd is usually capable of compressing way better

Expected behavior The generated zst file should be about the same size for two 1GiB file than for one 2GiB file

Desktop (please complete the following information):

Additional context As an aside, I'm not sure how to use paramgrill for choosing compression levels with zstd --patch-from. Can paramgrill help with this? Should I run it against the old file? The new one? A concatenation of both? Or not at all?

Thanks!

Cyan4973 commented 4 weeks ago

1) It is generally expected that, as the file size grows, finding back the right reference into the dictionary becomes more difficult. This can probably be improved over time, but this is a fuzzy problem: the exact impact depends on the scenario, i.e. the type of content and the compression parameters. 2) Speaking of which, which compression level are you using in your tests ? 3) paramgrill is unfortunately not designed to support --patch-from.

reyqn commented 4 weeks ago

I tried compression levels from 1 to 19, and even negative ones, they don't make much of a difference.

I just found it quite surprising that truncating the files at a different point could have a 40x impact on ratio. I'm curious if you would have recommendations for where I should truncate files. Right now I thought 2GB was good, it's not much less than the hard limit, and I didn't see it backfire yet.

It's too bad paramgrill's result cannot translate at all to patch-from, it would have been nice.

Cyan4973 commented 4 weeks ago

The classical test we do to check --patch-from efficiency is the linux kernel tarball update.

Right now, if I'm using the latest 2 releases as an example, compressing linux-6.12-rc5.tar from linux-6.12-rc4.tar, which are both ~1.5 GB in size, I'm getting a patch of ~30 MB. Which seems fine.

I noticed than if I truncate somewhere between 2GB and 2GiB, the more I approach 2GiB the less patch-from is efficient.

Which sizes have you tried?

reyqn commented 4 weeks ago

If memory serves right, I tried :

I don't have access to a lot of more than 2GiB versioned files, but I originally changed my truncate point from 2GiB to 1GiB after noticing the same issue for other files, so I thought this wasn't an isolated case, and there might be a soft limit to what patch-from can do.

Cyan4973 commented 4 weeks ago

OK, this is really at the tail end that things worsen, after 0x7FF00000 which is still fine, and already very close to the full 2 GB (that's basically the last MB).

Things get worse at 0x7FFF0000, which is just 64 KB shy of the 2 GB limit. Since a full block is 128 KB, I wouldn't be surprised if there was some kind of limit there.

reyqn commented 4 weeks ago

This is exactly the kind of explanation I hoped for.

Zstd has been nice to work with. Thanks for your work and your quick and insightful answers.