Use different chunker parameters for SQLite files

sophie-h commented 3 years ago

As already mentioned in #2412 SQLite seems to be the main pain point with frequent backups in terms of subpar optimization.

The alternative setting of

chunker-params 15,19,17,4095 (so ~512k blocks instead of ~2M blocks)

proposed by @edanaher seems to reduce incremental backup size by a factor of ~1/3 for SQLite files with few changes. Compression algorithms do not seem to be able to come close to this.

SQLite files can experience frequent small changes on a typical Desktop setup because they are used by

Browsers,
Mail clients, and
Chat clients.

SQLite files are relatively easy to detect since the file begins with "SQLite format 3". An exception make SQLCipher databases (used by Signal's desktop chat client) that could be detected by a .sqlite file ending.

Pros

It sounds relatively easy to implement.
It does not require any change in storage architecture.
It makes frequent backups more feasible on many desktops.

Cons

While a factor of 1/3 is significant, it might note be the super impressive gain for the problem at hand.
It's an incredibly specific hack.
The well tested chunker-params would be touched. It would be required to find criteria within this would be safe. (Check available memory or size of the SQLite file?)

ThomasWaldmann commented 3 years ago

17 mask bits mean 128kiB target size.

ThomasWaldmann commented 3 years ago

This is a specific case (sqlite files) of a more general idea "content-type specific chunkers".

A slight issue with content-based file type detection:

file needs to be opened and (at least the beginning needs to be read)
then we could configure the chunker
then we need to open/read the file again for feeding the chunker
one could optimize for less opening/reading, but at cost of more complex code structure, I guess.

sophie-h commented 3 years ago

That's because the chunker is written in C?

ThomasWaldmann commented 3 years ago

The buzhash chunker is C yes (with a wrapper in Cython), the fixed size chunker is Cython or Python. But what I meant is more a code structure thing than the implementation language.

buck2202 commented 11 months ago

Discovered this report via google and just wanted to add a datapoint-- I'm managing a friend's server backups for the game "Vintage Story," which are (bafflingly, imo) written as a sqlite3 database. I tested briefly with two server backups an hour apart--the bulk of the data is "chunks" in the world that are spawned when a player "discovers" them for the first time, and there was no in-game activity in this period

With default chunker params and `-C auto,zstd,10`	Backup #	Original	Compressed	Deduplicated
1	12.45 GB	4.02 GB	4.02 GB
2	12.45 GB	4.02 GB	518.34 MB

With `-C auto,zstd,10 --chunker-params buzhash,15,19,17,4095`	Backup #	Original	Compressed	Deduplicated
1	12.45 GB	4.15 GB	4.15 GB
2	12.45 GB	4.15 GB	62.15 MB

and finally with `-C auto,zstd,10 --chunker-params buzhash,10,23,16,4095` (from the docs for "fine-grained" deduplication)	Backup #	Original	Compressed	Deduplicated
1	12.45 GB	4.17 GB	4.17 GB
2	12.45 GB	4.17 GB	37.26 MB

so, a huge improvement...nearly an order of magnitude with the params in this thread, and another factor of two afterwards

If nothing else, I just wanted to say thank you for these optimized parameters. I had considered dumping to sql files, but the process is prohibitively slow for files of this size...and I didn't expect I could make such a huge difference until I saw this discussion

last edit to add: dedup with default chunker on sql dumps of these files was down to a few hundred kB, but at 3-4 hours to dump each file, it's not really practical for the space savings

borgbackup / borg

Use different chunker parameters for SQLite files #5877

Pros

Cons