facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.17k stars 2.06k forks source link

--patch-from extensions and speed optimizations #2173

Open ghost opened 4 years ago

ghost commented 4 years ago

Is your feature request related to a problem? Please describe. The --patch-from option only accept that are not larger than 4GB. My files are >5GB (the most) and very similar, they only differ in language, so the patch would be about 200MB.

Describe the solution you'd like I would like zstd to support files larget than 4GB for the --patch-from option.

Describe alternatives you've considered No alternatives, because I couldn't find any solutions regarding this error.

Additional context

PS Y:\zstd> .\zstd.exe --patch-from=.\en_windows_10_consumer_editions_version_2004_x64_dvd_8d28c5d7.iso .\de_windows_10_consumer_editions_version_2004_x64_dvd_7efdffc7.iso -o patch
zstd: error 42 : Can't handle files larger than 4 GB
bimbashrestha commented 4 years ago

Thanks for the posting the issue.

The restriction is actually 2GB instead of 4GB. That was a mistake and I just merged a pr to fix it.

It would be nice to lift this restriction entirely at some point. But that will require quite a bit of work. I'm going to aim to have something like that for the next release if there is a lot of interest.

A good (easier) first step might be to just bump the restriction to 4GB. I'd have to look into how involved that would be but it would definitely be a smaller patch than allowing for arbitrarily large files.

ghost commented 4 years ago

What would be the main issue with supporting files larger than 4GB?

felixhandte commented 4 years ago

@luzeagithub, although the zstd codebase can compress arbitrarily large inputs, and although the Zstd format specification allows arbitrarily distant matches, this codebase internally uses 32 bit integers to represent match offsets. So in practice this implementation can only reference data within 4 GB of the current position in the stream. --patch-from mode operates as if you're compressing the new version using the old one as a dictionary, which places them in sequence like this: [old file contents...][new file contents...]. If the old file is larger than 4 GB, this implementation will therefore be unable to internally represent matches from a given part of the new file to the corresponding part of the old file, which is the whole point of --patch-from.

This could be fixed by using 64 bit integers to represent offsets internally, but that would be a large, scary refactor...

ghost commented 4 years ago

Ok, thanks for the explanation. Seems I have to use 3rd-party program (smartversion) to do it. It also has zstd support (which I am using) with files >4GB.

bimbashrestha commented 4 years ago

I'm going to keep this issue open as a reminder that there is interest in this

Cyan4973 commented 4 years ago

This item is present in our backlog, but no date set so far.

sergeevabc commented 3 years ago

Not sure if I should open a new issue for the following, so let me utter it here.

After playing with Xdelta, JojoDiff, Courgette, and Hdiff to compare 2 binaries as fast as possible and generate a patch as small as possible (that will be aware of target checksum not to be used blindly in the future), I have come to the conclusion that UX matters.

And… --patch-from semantic looks awkward to me, therefore cannot remember it, always coming back to the manual (alas, this topic is not within --help to date). My train of thought goes as follows [appname] [do what and how] [sources to be processed (here: compared)] [output (here: difference)]]. I would prefer more straightforward switches without = signs. Something like this:

$ zstd --patch -19 --chainLog 30 old.exe new.exe -o latest.zdiff
$ zstd --patch -d old.exe latest.diff -o new.exe
Cyan4973 commented 3 years ago

The = is optional. Both --patch-from reference and --patch-from=reference work the same way.

--patch-from requires a parameter (the file to patch from), it's not optional. The parameter is positioned right next to the command so that it's clear that it is its parameter, not an additional file to (de)compress.

zstd file1 file2 already means "compress file1 into file1.zst, then file2 into file2.zst" zstd file1 file2 -o dest means "compress file1 then file2 and concatenate their compressed streams into dest".

tallzabby commented 2 years ago

Since interest seems to be important, thought I would mention I am very interested. Was using zstd patch-from for an upgrade program... file has just recently grown to 2GB ;). I can economize a bit to kick the can down the road (otherwise I'll need to do a more extensive revamp).

Any thoughts on the chances of this getting worked on soon? Just wondering.

PiotrSrebrny commented 11 months ago

Is there any chance to add support for patches made from files > 2GB?

Cyan4973 commented 11 months ago

We are considering it. This feature is part of our backlog list.

PiotrSrebrny commented 11 months ago

I noticed that I was on your backlog 3 years ago. Do you have some expected release for this feature?

sergeevabc commented 7 months ago

Aggrrhh, --patch-from, that from part, still drives me crazy. Relatives return home from abroad, a kitchen knife is taken from the drawer, but in the case of patching a file I struggle to understand what it has to do with from. This flag is immediately followed by a file that needs to be patched (what file, not from file).

Why introduce ambiguity when it's clearer this way:

# Create patch
$ zstd    --patch original.exe modified.exe -o patch.zdiff

# Apply patch
$ zstd -d --patch original.exe patch.zdiff  -o modified.exe
                  patch what   with what       into what

# Or even
$ zstd --patch-create ...
$ zstd --patch-apply  ...
Cyan4973 commented 7 months ago

One of the issues we had with naming this feature is to make sure users understand that the produced patch only works one way, from one source. Some users were expecting the patch to be able to do both ways, i.e. regenerate the source from the destination + the patch. So we want to be sure no one keeps such expectation.

I'm fine with alternative command names that help clarity.