Saving/reusing zstd dictionary

colinxs commented 3 years ago

zstd supports creating a dictionary from a set of files which can then be used to speed up/increase the compression ratio on subsequent compressions. Looking at the code (and please correct me if I'm wrong here) in token.c and match.c, it appears that the that a new dictionary is used for each file. Naively, it seems like that dictionary could be shared for all files in the tree. Based on some benchmarking with a large set of TOML files, the performance increase is significant when using a dictionary.

Here is a discussion of this idea: https://unix.stackexchange.com/questions/553111/is-the-rsync-block-compression-dictionary-reset-for-each-file

As an extension to reusing the dictionary across files within a single call to rsync, a user could (optionally) provide an external dictionary or reuse one that rsync generates (similar to Batch Mode and --write-batch/--read-batch).

Both of these things would help significantly for something like real-time sync (as lsyncd, which is basically inotify + rsync, does) where I'm continuously rsyncing a file tree across a network using compression.

chadbrewbaker commented 2 years ago

@Cyan4973 anything you are using internally that could be upstreamed to rsync?

Even better than small dictionaries, rsync has entire files you could use like reference frames for MPEG2? https://en.wikipedia.org/wiki/Reference_frame_(video)

A postgres/mysql database block format aware encoder would definitely be useful.

Cyan4973 commented 2 years ago

There is the --patch-from mode that could be used for that, but it's currently limited to < 2 GB reference and data size to compress.

RsyncProject / rsync

Saving/reusing zstd dictionary #187