zstd --patch-from very inefficient for files about 2GiB

reyqn commented 4 weeks ago

Describe the bug I'm using patch-from to generate diffs for very big files, and the files being over the 2GiB limit I truncate them before calling zstd. However, if I truncate them to be 2GiB, patch-from is very inefficient (truncating at 2GB makes the diffs about 40MiB, 2GiB generates a 1.8GiB diff, that's more than 40x less efficient).

I noticed than if I truncate somewhere between 2GB and 2GiB, the more I approach 2GiB the less patch-from is efficient.

To Reproduce Steps to reproduce the behavior:

Take two files of different versions about 2GiB
Call zstd --patch-from v1 v2
The generated zst file is huge, even if zstd is usually capable of compressing way better

Expected behavior The generated zst file should be about the same size for two 1GiB file than for one 2GiB file

Desktop (please complete the following information):

OS: Windows
Version 11
I used zstd.exe from release 1.5.6

Additional context As an aside, I'm not sure how to use paramgrill for choosing compression levels with zstd --patch-from. Can paramgrill help with this? Should I run it against the old file? The new one? A concatenation of both? Or not at all?

Thanks!

Cyan4973 commented 4 weeks ago

1) It is generally expected that, as the file size grows, finding back the right reference into the dictionary becomes more difficult. This can probably be improved over time, but this is a fuzzy problem: the exact impact depends on the scenario, i.e. the type of content and the compression parameters. 2) Speaking of which, which compression level are you using in your tests ? 3) paramgrill is unfortunately not designed to support --patch-from.

reyqn commented 4 weeks ago

I tried compression levels from 1 to 19, and even negative ones, they don't make much of a difference.

I just found it quite surprising that truncating the files at a different point could have a 40x impact on ratio. I'm curious if you would have recommendations for where I should truncate files. Right now I thought 2GB was good, it's not much less than the hard limit, and I didn't see it backfire yet.

It's too bad paramgrill's result cannot translate at all to patch-from, it would have been nice.

Cyan4973 commented 4 weeks ago

The classical test we do to check --patch-from efficiency is the linux kernel tarball update.

Right now, if I'm using the latest 2 releases as an example, compressing linux-6.12-rc5.tar from linux-6.12-rc4.tar, which are both ~1.5 GB in size, I'm getting a patch of ~30 MB. Which seems fine.

I noticed than if I truncate somewhere between 2GB and 2GiB, the more I approach 2GiB the less patch-from is efficient.

Which sizes have you tried?

reyqn commented 4 weeks ago

If memory serves right, I tried :

0x77777777 which is about 2GB with a result of about 45MiB diff
0x7F000000 with a result of about 47MiB
0x7FF00000 with a result of about 50MiB
0x7FFF0000 with a result of about 450MiB
0x7FFFF000 with a result of about 1.5GiB
0x7FFFFFFF with a result of about 1.8GiB

I don't have access to a lot of more than 2GiB versioned files, but I originally changed my truncate point from 2GiB to 1GiB after noticing the same issue for other files, so I thought this wasn't an isolated case, and there might be a soft limit to what patch-from can do.

Cyan4973 commented 4 weeks ago

OK, this is really at the tail end that things worsen, after 0x7FF00000 which is still fine, and already very close to the full 2 GB (that's basically the last MB).

Things get worse at 0x7FFF0000, which is just 64 KB shy of the 2 GB limit. Since a full block is 128 KB, I wouldn't be surprised if there was some kind of limit there.

reyqn commented 4 weeks ago

This is exactly the kind of explanation I hoped for.

Zstd has been nice to work with. Thanks for your work and your quick and insightful answers.

facebook / zstd

zstd --patch-from very inefficient for files about 2GiB #4182