facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.55k stars 2.09k forks source link

Add built-in support for bi-directional deltas in zstd --patch-from #2063

Closed RubenKelevra closed 4 years ago

RubenKelevra commented 4 years ago

Is your feature request related to a problem? Please describe.

I like to archive older versions of programs like patch does for text-files. But there seems to be no good and fast solution to create delta patches for binary files which can be applied in both directions.

Describe the solution you'd like

The ideal solution would be to specify two uncompressed files and a dictionary that was trained on the filetype and get a small delta that can be applied forwards (to patch the old version to the new one) and backward (to patch the new version to the old one).

This way I can efficiently copy the new file to secondary storage, store it permanently. Then I generate the new version from the old file and delete the old file.

Describe alternatives you've considered

There's bsdiff which is quite slow and cannot be trained to be efficient on a specific filetype, and there's courgette that is highly specialized - and cannot be used on random binary files (as far as I understand).

Both are unable to patch in the reverse direction, which is necessary if you want to store the newest file completely not the oldest one. Otherwise, every access to the newest file would require to apply all patches one after another.

bimbashrestha commented 4 years ago

Hi @RubenKelevra,

Thanks for submitting the issue.

We are currently in the process of optimizing zstd for producing patches by introducing a new cli option called --patch-from= which should be a part of the next release! We have been benchmarking our results against bsdiff till now and we are indeed, going to be an order of magnitude faster even with slower zstd levels. We don't have plans for this feature to be bidirectional (ie. you will be able to produce the new file by applying the patch to the old file but you won't be able to produce the old file by applying that SAME patch to the new file).

Supporting the the reverse operation will be tricky since dictionary compression doesn't intrinsically support that but I think we can flag this as a potential later refinement to the patch-from feature (maybe :)). I'll defer to @Cyan4973 for on whether this is feasible/within the scope of a compression library.

In case you have not already seen the other related thread that sparked the development of this feature (https://github.com/facebook/zstd/issues/1935).

I'd also like to ask you some questions about your use case. How are you planning on using this diff engine? How large are the files you are compressing? If it is possible to share, what files are they? (so far, we've been testing our patch-from on consecutive versions of popular repos on github) Is bidirectionally a core requirement for you?

bimbashrestha commented 4 years ago

Also a quick note about this:

Both are unable to patch in the reverse direction, which is necessary if you want to store the newest file completely not the oldest one. Otherwise, every access to the newest file would require to apply all patches one after another.

The issue of having to apply consecutive patches starting from the first one can to some extent be mitigated by having patches for consecutive versions as well as patches that skip N versions (so the patch would take version-m to version-(m+N)). This can accelerate patching considerably but it is also slightly more complex.

This article outlining zstd's usage for patching new releases of the video game league of legends mentions this and some other techniques. Might be worth a look @RubenKelevra https://technology.riotgames.com/news/supercharging-data-delivery-new-league-patcher

RubenKelevra commented 4 years ago

We are currently in the process of optimizing zstd for producing patches by introducing a new cli option called --patch-from= which should be a part of the next release!

That's great news!

Will this only work on zstd archives, or is this meant to work on two uncompressed files?

We don't have plans for this feature to be bidirectional (ie. you will be able to produce the new file by applying the patch to the old file but you won't be able to produce the old file by applying that SAME patch to the new file).

Well, a simple bidirectional version would probably be to concat each of the uncompressed update-commands (forward/backward) and select just the even one for one operation and the uneven ones for the other direction. Regular zstd compression should eliminate the redundancies pretty well without any complex code.

How we plan to use zstd

I'm currently discussing plans to get some compression into IPFS.

The idea is to be able to ship for example system updates efficiently, but since it's open to the users how to use this, we might end up having completely different binary data using this.

How IPFS addresses data and stores them (simplified)

The idea is to extend the data storage component with zstd compression. So each block will be an individual archive, which means there's a strong need for a dictionary because of the small filesizes. The idea is to ship pretrained dictionaries for mime types and select the right dictionary and compress all blocks on the data storage level with it.

The pretrained dictionaries will be shipped (or better fetched on demand) from the network if a node gets a block as a zstd archive. This way all nodes agree on the same dictionaries, while we don't blow up the binary with a lot of dictionaries that get rarely used.

How we plan to use diffs

How are you planning on using this diff engine?

The plans on diffs are much rougher, but it would be nice if we could update files in IPFS bandwidth and storage efficiently.

So it would work on a high level in IPFS best:

Then we create a new object, containing the new file as a reference, the diff file as a reference and the old file's metadata as reference.

Doing one delta update

If a node gets this new object as a download request by the user, it can search for the old blocks on its data storage. When it's incomplete it has to either receive the missing blocks from the network and then the patch, or the whole new file (depending on the differences in size).

Getting an older version from the network as delta

These version files can be nested - so a node can move downwards from new files to older versions of the file, by just applying multiple patches.

Doing delta updates over multiple versions

If a node gets an object which refers to an object which doesn't exist in the storage, but which also refers to older versions, the node can traverse the tree of references down to older versions in an effort to save some bandwidth - since the metadata include file sizes, the efforts can be stopped either on a fixed limit or if the delta patches would be larger than download the new file.

How large are the files you are compressing?

I'm currently just looking at the usage of this method for compressing update of software - so several MB I guess. But this is just my personal use-case. I think there will be people trying this on much larger files if it's implemented.

Is zstd searching with a gliding window in both files at the same time, while compressing the data inside the window with the dictionary - or how is this going to work on larger files? :)

If it is possible to share, what files are they?

One application I'm looking into is archiving...

Here's an example of a large archive of binaries that could be pretty well delta compressed (after decompressing the tar files).

https://archive.archlinux.org/

Another example is this archive (with much much larger files).

https://planet.openstreetmap.org/planet/

Is bidirectionally a core requirement for you?

Yes.

There are two use-cases which overlap: Differential updates of files and differential storage of older versions of files.

The issue of having to apply consecutive patches starting from the first one can to some extent be mitigated by having patches for consecutive versions as well as patches that skip N versions (so the patch would take version-m to version-(m+N)). This can accelerate patching considerably but it is also slightly more complex.

Yes, but the standard use-case is just jumping one or two versions forwards - so no real need to accelerate the two jumps thing, while the jump backward is necessary for archiving. The newest files are always available and the diffs to older versions will naturally get sparser and sparser until there's no copy anymore on the network (if not somebody is holding a full copy of the history on purpose).

Thanks for the interest and the fast response! :)

RubenKelevra commented 4 years ago

Btw: My test-data for the compression part was the Linux kernel sourcecode.

I tested the compression levels 3, 9, 15 and 22 with a standard-dictionary size, no dictionary and a 1MB dictionary against brotli and bzip2 as well as gzip.

I've created for each compression level individually dictionaries since the man page explains this will optimize the dictionary for the specific compression level.

A 1 MB dictionary hurts actually the compression ratio on level 15, so using a standard-sized dictionary results in a smaller archive.

Results

The size of the dictionary was not factored in.

zstd on 22 1MB dictionary: 5.931x 110KB dictionary: 5.628x no dictionary: 4.731x

brotli 5.249x

bzip2 4.595x

gzip 4.409x

bimbashrestha commented 4 years ago

Thanks for the response @RubenKelevra

Will this only work on zstd archives, or is this meant to work on two uncompressed files?

Maybe I misunderstand this question but the usage of the new cli will be:

# create the patch
zstd --patch-from=<old_uncompressed_file> <new_uncompressed_file> -o patch.zst 

# apply the patch
zstd -d --patch-from=<old_uncompressed_file> patch.zst -o <new_uncompressed_file> 

Well, a simple bidirectional version would probably be to concat each of the uncompressed update-commands (forward/backward) and select just the even one for one operation and the uneven ones for the other direction. Regular zstd compression should eliminate the redundancies pretty well without any complex code.

Yes, that would work. I think any approach we take here will involve concatenating two separate patches (one for the forward direction and one for the backward direction). We talked about this in our meeting last week and the verdict we came to was that we would likely not support built-in bi-directionality at least not for the next release because:

  1. We're not convinced that the added benefit of convenience over just having the user of zstd maintain two separate patches on their own is worth the additional complexity..
  2. We currently have our hands full for the next release:/

I'm currently discussing plans to get some compression into IPFS.

Cool! Glad you're considering zstd:)

Thanks for the detailed report about your use case! Just some overall comments after I read through them:

Your usage of dictionary compression sounds similar to how zstd is used within Facebook so I suspect that the approach you have laid out will work very well.

As for your segment on delta patching, I see now that your use case is slightly different than the conventional one of incrementally updating existing software. You also plan to use deltas as a way to travel backwards in history to "stored" older versions of files. Unfortunately, like I mentioned, we don't expect built-in bi-directionality to be supported for 1.4.5 (you can of course still use zstd and maintain a concatenated version of a forward and backward patch yourself).

Perhaps a future release will include this feature? We'll have to gauge how much of a demand exists for it after we release 1.4.5. For the time being, I'm going to mark this feature request as long-term.

RubenKelevra commented 4 years ago

I removed my last comment here, I just reread it and it made no sense. I obviously was a bit tired while reading your response. I'm sorry! :)


Thanks for the response @RubenKelevra

Will this only work on zstd archives, or is this meant to work on two uncompressed files?

Maybe I misunderstand this question but the usage of the new cli will be:

# create the patch
zstd --patch-from=<old_uncompressed_file> <new_uncompressed_file> -o patch.zst 

# apply the patch
zstd -d --patch-from=<old_uncompressed_file> patch.zst -o <new_uncompressed_file> 

You understood the question right, thanks - that's actually exactly what I need! :)

As for your segment on delta patching, I see now that your use case is slightly different than the conventional one of incrementally updating existing software. You also plan to use deltas as a way to travel backward in history to "stored" older versions of files. Unfortunately, as I mentioned, we don't expect built-in bi-directionality to be supported for 1.4.5 (you can of course still use zstd and maintain a concatenated version of a forward and backward patch yourself).

Indeed, it's easy to store two separate patches for each direction in one archive. But if each operation for a changed segment in a file is saved first forwards and then backward, it's much easier to compress, which will end up a lot of space if there are a lot of versions. :)


And yes traveling back in time is actually a bigger goal than having small updates, since bandwidth gets cheaper and cheaper, but storage seems to plateau in the last years pretty much.

Archiving and having a reliable way to get older versions is one of the main cost points apart from having enough money to spend on calculation power and storage - and gets often therefore skipped.

I like to have this feature come to IPFS, to offer an easy solution for community-based cluster storages which have a reliably, space-efficient archiving function right build into it.

So if I had to choose one direction to save delta patches, it's an easy choice: backward.

Storing forward patches means, that one patch missing somewhere in between will cut you off from everything newer, while you often just want to jump some versions backward, not month or years. It also means you have to fully store the oldest version, next to the newest, reducing space efficiency.

And for clearing up old versions, there's the need to apply patches - which will consume a lot of processing power, while if the patches are facing backward, you can just drop a few if you run out of storage space.

Perhaps a future release will include this feature? We'll have to gauge how much of a demand exists for it after we release 1.4.5. For the time being, I'm going to mark this feature request as long-term.

Thanks for considering!

Best regards

Ruben

bimbashrestha commented 4 years ago

I removed my last comment here, I just reread it and it made no sense. I obviously was a bit tired while reading your response. I'm sorry! :)

No need to apologize. Always appreciate feedback:)

RubenKelevra commented 4 years ago

A bit offtopic, but related to my project, so I thought I ask here:

I was wondering if there are free dictionaries for different types of data available you're aware of. I couldn't find one. If not I would go ahead and create some :)

Any recommendations on the settings? Or are the default settings ok when I create a general-purpose dictionary for a type of data?

I currently just plan to increase the size of the dictionary to something like 1 MB.

bimbashrestha commented 4 years ago

I was wondering if there are free dictionaries for different types of data available you're aware of. I couldn't find one. If not I would go ahead and create some :)

To my knowledge, there is not. I can see this being a very valuable thing to do though:)

Any recommendations on the settings? Or are the default settings ok when I create a general-purpose dictionary for a type of data?

I think the default dictionary trainer should do a good job. If you do go ahead and create this, it might also be a good idea to periodically (automatically maybe?) train with new data of each type and replace the existing dictionary if the new trained dictionary performs better on some sort of separate validation set. @felixhandte might be able to chip in here since he's done a lot of this type of work internally for us.

I currently just plan to increase the size of the dictionary to something like 1 MB.

Playing around with dictionary sizes is a good idea. Athough I believe for most data types, the same compression ratio can be obtained using smaller dictionaries (110K which is the default for example).

bimbashrestha commented 4 years ago

Quick additional note:

Dictionaries are most useful when the data you're trying to optimize for has high intra-sample redundancy. So in other words, the more "narrow" your data set, the better. Having general purpose dictionaries might work if the "general" case is very similar in content (but then it might not be so general anymore).

RubenKelevra commented 4 years ago

Playing around with dictionary sizes is a good idea. Athough I believe for most data types, the same compression ratio can be obtained using smaller dictionaries (110K which is the default for example).

I tried 1 MB and the default, for the Linux Kernel corpus. With compression level 22 the 1 MB dict was beneficial, while levels 3, 9 and 15 were actually hurt by the larger dictionary - quite unexpected.

I created a dictionary explicitly for the compression level on which I used it afterward - so it was somewhat strange that a larger dictionary would hurt it. 🤔

The funniest thing was actually, to look into the dictionary. There are extremely large snippets of source code. Sure they might be copied and pasted in multiple source files by the developers, or the dict generator just copied large snippets in the database to avoid having to save them in the archive :laughing:

The Linux archive dict: https://ipfs.io/ipfs/bafybeiblnqhkxiicwftnihtvrwwzzd5ba6g3ieejng5s2b5hkn3zhtqafi

I think the default dictionary trainer should do a good job. If you do go ahead and create this, it might also be a good idea to periodically (automatically maybe?) train with new data of each type and replace the existing dictionary if the new trained dictionary performs better on some sort of separate validation set. @felixhandte might be able to chip in here since he's done a lot of this type of work internally for us.

That's a good idea, but a thing I won't do.

The idea was to use small byte long integers to address those dictionaries and match them up to a list stored in the application. So if the user wants to store some say SQL files, the pretrained dictionary can be fetched from the list and be used on those files.

Having different versions means we have to use larger integers for the dictionary and may have blocks from the same format compressed with different dictionaries, which means we have to hold multiple different versions on a node.

This can get pretty messy and hard to debug tbh.

Quick additional note:

Dictionaries are most useful when the data you're trying to optimize for has high intra-sample redundancy. So in other words, the more "narrow" your data set, the better. Having general purpose dictionaries might work if the "general" case is very similar in content (but then it might not be so general anymore).

Well, I'm currently working on a dictionary for English, which should optimize itself down (in my theory) to the most used terms. I thought to push some email archives and some Wikipedia articles into the algorithm and look how the dictionary performs after that. :)

bimbashrestha commented 4 years ago

Hey @RubenKelevra,

Just wanted to inform you that we've merged --long range support for the patch-from feature into zstd dev today. Here is a brief summary: https://github.com/facebook/zstd/pull/1959#issue-364364610

Unless other developments arise, this is what will make it to the release. Would love it if you gave it a shot on your use case at some point. Hopefully the long mode should provide a nice boost to compression ratio from before.

Bimba

RubenKelevra commented 4 years ago

Just wanted to inform you that we've merged --long range support for the patch-from feature into zstd dev today. Here is a brief summary: https://github.com/facebook/zstd/pull/1959#issue-364364610

Unless other developments arise, this is what will make it to the release. Would love it if you gave it a shot on your use case at some point. Hopefully the long mode should provide a nice boost to compression ratio from before.

Hey Bimba,

thanks for the heads-up, sounds great!

I will check that on some sample data and report back, thanks!

--

Is there any documentation about your delta patch algorithm?

I haven't looked at the code yet... but I gave the problem some thoughts. Guess that a patch format for binaries would look something like this:

Since you are using the dictionary function for that I suspect that your approach is more like a 1:1 search pattern to be replaced by new data.

This means each new patch needs to make sure it's pattern is either unique or the replacement is everywhere the same, while on smaller patterns a direct comparison is done, on larger ones a hashsum is used.

How far off am I? :)

Best regards

Ruben

bimbashrestha commented 4 years ago

Hi @RubenKelevra,

Since you are using the dictionary function for that I suspect that your approach is more like a 1:1 search pattern to be replaced by new data.

If I understand this correctly, this is more akin to what is happening within patch-from.

There isn't a new algorithm introduced exactly. We've simply extended the existing LZ style match finding (which is primarily what sets different compression levels apart) a little bit to allow for better results when dealing with larger dictionaries.

I haven't looked at the code yet... but I gave the problem some thoughts. Guess that a patch format for binaries would look something like this:

offset in bits old length in bits drop flag (beginning or end) patch mode (replace/xor) patch

We are finding exact matches between the 'source' (which in this case is the new version of the file) and the 'dictionary' (the old version) and encoding them as before into the zstd compression format. The details of the representation are not unique to the patch-from feature.

Hope that answered your question. I'm happy to go into more detail about the actual match finding algorithms we have if you'd like but those are generic to regular (non patch-from) compression.

Bimba

RubenKelevra commented 4 years ago

Thanks!

I'm was wondering if there's a chance to speed things up even more:

IPFS has the ability to dedup blocks between different types of files, for this they use a rolling hash algorithm.

You can either select rabin or buzzhash for this task (in IPFS). Rabin is kind of slow, but buzzhash is quite fast.

The rolling hash would allow to 'prescan' both files, get some cut marks and run some fast cryptographic hash algorithm over the chunks, like blake2b.

I think both operations are much cheaper than pattern matching. This way you can skip all pattern matching attempts which are on both sides (A and B) inside the known equal blocks.

The first layer of patching would just generate a lengths+offset+move triple, which can copy the blocks from the original file into a sparse file as first patching operation.

The pattern matching rules could be used on top of that, completing the gaps of the output file.

bimbashrestha commented 4 years ago

Hi @RubenKelevra sorry for the super late response.

Yes! there are definitely ways to improve speed and ratio here. But implementing those will involve more discussion. I'm going to close this particular issue for now as the feature has been merged.

Let's this discussion on https://github.com/facebook/zstd/issues/2173 which also requests an extension to --patch-from. Our issues are getting a bit cluttered so I'm going to consolidate:)

RubenKelevra commented 4 years ago

Yeah it got pretty off-topic, sorry for that. I've created a new ticket for that. :)