Open sophie-h opened 3 years ago
17 mask bits mean 128kiB target size.
This is a specific case (sqlite files) of a more general idea "content-type specific chunkers".
A slight issue with content-based file type detection:
That's because the chunker is written in C?
The buzhash chunker is C yes (with a wrapper in Cython), the fixed size chunker is Cython or Python. But what I meant is more a code structure thing than the implementation language.
Discovered this report via google and just wanted to add a datapoint-- I'm managing a friend's server backups for the game "Vintage Story," which are (bafflingly, imo) written as a sqlite3 database. I tested briefly with two server backups an hour apart--the bulk of the data is "chunks" in the world that are spawned when a player "discovers" them for the first time, and there was no in-game activity in this period
With default chunker params and -C auto,zstd,10 |
Backup # | Original | Compressed | Deduplicated |
---|---|---|---|---|
1 | 12.45 GB | 4.02 GB | 4.02 GB | |
2 | 12.45 GB | 4.02 GB | 518.34 MB |
With -C auto,zstd,10 --chunker-params buzhash,15,19,17,4095 |
Backup # | Original | Compressed | Deduplicated |
---|---|---|---|---|
1 | 12.45 GB | 4.15 GB | 4.15 GB | |
2 | 12.45 GB | 4.15 GB | 62.15 MB |
and finally with -C auto,zstd,10 --chunker-params buzhash,10,23,16,4095 (from the docs for "fine-grained" deduplication) |
Backup # | Original | Compressed | Deduplicated |
---|---|---|---|---|
1 | 12.45 GB | 4.17 GB | 4.17 GB | |
2 | 12.45 GB | 4.17 GB | 37.26 MB |
so, a huge improvement...nearly an order of magnitude with the params in this thread, and another factor of two afterwards
If nothing else, I just wanted to say thank you for these optimized parameters. I had considered dumping to sql files, but the process is prohibitively slow for files of this size...and I didn't expect I could make such a huge difference until I saw this discussion
last edit to add: dedup with default chunker on sql dumps of these files was down to a few hundred kB, but at 3-4 hours to dump each file, it's not really practical for the space savings
As already mentioned in #2412 SQLite seems to be the main pain point with frequent backups in terms of subpar optimization.
The alternative setting of
proposed by @edanaher seems to reduce incremental backup size by a factor of ~1/3 for SQLite files with few changes. Compression algorithms do not seem to be able to come close to this.
SQLite files can experience frequent small changes on a typical Desktop setup because they are used by
SQLite files are relatively easy to detect since the file begins with "SQLite format 3". An exception make SQLCipher databases (used by Signal's desktop chat client) that could be detected by a
.sqlite
file ending.Pros
Cons