Deduping with prefix/suffix of chunks

Summary

In a perfect world, a backup solution would be able to dedup arbitrarily sized similar segments, but that's not practical. However, it occurs to me that it should be possible to store a hash of the first and last segment (~1k?) of each chunk, and if it matches with a chunk that isn't otherwise deduplicated, we may be able to share the prefix or suffix with existing chunks

Motivation

I'm looking at using borg to save all my hourly snapshots of my home directory from btrfs (since btrfs doesn't handle very large number of snapshots well, and borg provides much better compression). I believe that a significant fraction of the diffs are changes to browser database files (history, cookies, etc.), which I would guess are mostly small updates in the middle of large files. So storing the entire chunk from scratch will be fairly wasteful, since only a small part of it changed.

Even for larger changes, if they only change an isolated subset of a file (e.g., VM images), on average half of a chunk will be unchanged yet re-stored for each change. This seems like an opportunity to save some space.

Quick testing on my dataset bears this out; I tried chunker-params 15,19,17,4095 (so ~512k blocks instead of ~2M blocks). I get a ~3% larger initial backup (~65GB), presumably since lzma compression works much better with larger chunk sizes. However, most of the diffs are significantly smaller, roughly 30MB vs 40MB, presumably because most of the changed chunks were mostly unchanged, so I only had to re-backup 512k instead of 2MB for each change. (For a fairly ideal comparision, btrfs send | xz is giving compressed deltas of roughly 20MB, so there's still room for improvement)

So I'd like to keep the the benefits of larger chunks (better compression, less overhead) while not paying quite so high a penalty for sparse changes.

Potential implementation

I'm only been playing with borg for a few days, and haven't fully thought through the implications, so this may be completely impractical. But here's my first pass at how to do this (hopefully) without breaking the properties that make borg work so well.

For each chunk, in addition to its hash, store its prefix and suffix hashes (fairly small segments, but big enough to avoid spurious collisions; I'd guess ~1k). Now if we add a chunk that doesn't match an existing chunk, check if its prefix or suffix matches the prefix or suffix of an existing chunk. If so, compare the chunks to see how long the match is, and split the new chunk into portions pointing at the pre-existing prefix and/or suffix as well as the new data in the middle.

For example, assume we only hash four byte prefixes and suffixes (and that chunk splitting also only depends on four byte rolling hashes), and we already have the following chunks:

Hello world
I'm sorry, Dave
ithub.com lastvisit=2017-04-10 goog

Then adding "Hello Dave" would prefix-match "Hello world" and suffix match "I'm sorry, Dave", not requiring writing any new chunks.

Adding "Hello Evan" would prefix-match "Hello " and add a new chunk "Evan"

Adding "ithub.com lastvisit=2017-04-11 goog" would prefix and suffix match on the existing github entry, only requiring writing "1". This is the sort of case where I hope to get big wins on my dataset, and I'd expect to also do well on other common cases like VM images.

Discussion

AFAICT, the biggest obstacle to implementing this is the notion of pointing to a subset of a chunk; it looks like borg currently assumes that chunks are atomic, so data structures would have to be changed to add pointers within the chunk. I don't know if this could be backwards compatible.

This would triple space required for hashes; I don't know how reasonable that it. Though since we're going to be comparing data to see how long the match is, collisions are only a performance issue, not a correctness issue, so smaller hashes could be used, reducing space requirements.

It could also result in fragmentation, as the new matching chunks could be smaller than the smallest normal size, and could actually result in worse compression over time as larger chunks aren't matched. Though if the chunks don't change at all, they shouldn't be rewritten, and if they do change, I suspect the deltas would still end up a net win.

This fragmentation could also end up hurting compression, though it seems like only pathological cases would end up hurting more than the extra deduplication saves.

Suffix matches could also hurt performance by up to a factor of 2, since we'd have to decompress the entire chunk to read the suffix, in addition to the work to read the rest of the chunk. Prefix matches should be fairly performance-neutral, though.

It would also slow down backups, since there would be extra hash checks and comparisons to find match lengths. But given that my incremental backups are about 2 minutes (most of which is scanning the filesystem for changes), I think that's a reasonable price.

Thoughts?

I'm curious if these seems remotely feasible/practical; if so, I'd likely be willing to take a shot at implementing it and seeing how much of a win it actually is.

I'm especially worried given that I can't find any existing system doing this, and I'm guessing there's a good reason. But I didn't come up with any dealbreakers in the process of writing this up, so I'm hoping that it's a possibility...

Interesting idea, but I think this is maybe adding a lot of complexity for little win.

Also it is somehow specializing on having same prefix/suffix (hoping the change is in the middle of a chunk and not at the chunk boundaries) while just using smaller chunks would address the general case without needing extra code.

About smaller chunks: we gave up using them by default (attic had 64kiB target chunk size, but regularly suffered from excessive memory usage by the chunks index, so we increased the default to 2MiB).

As your solution would triple the chunk index size, there is a similar memory usage problem (there's a formula in the docs about resource usage). You have to consider that some people are using this with many terabytes of data and not all of them have a huge amount of RAM for borg (also potentially many GBs). The worst case is often when people use borg on NAS devices, which often have only little RAM, but many TBs of disk.

I don't think it's worth it, because it essentially boils down to a sort of diff/patch algorithm (that only considers prefixes/suffixes and stores differing segments, not deltas). Most of the memory usage of Borg goes into places where hashes are stored, so doubling/tripling that isn't really an option.

Finding prefixes/suffixes does not sound efficient (algorithm-wise). [This is likely the reason why there is nothing commercial using this approach; global dedup using CAS-over-hashes means that chunks are context-free, which clashes heavily with the context-sensitive notion of generating or even just checking any sort of delta].

There is a commercial backup program that is mainly delta-based (librsync iirc), but can also use deduplication over a limited hash table and retains entries in the table by a MFA (most frequently accessed/referenced) policy.

There is also the big question whether that's actually worth the effort. A common offender here are SQLite databases that are frequently vacuumed by Firefox & friends, but these are small, a few dozen MB. Not worth sinking hundreds of developer hours on IMHO.

Thanks for the feedback!

First, on the more philosophical description of this as a diff/patch algorithm, yes, it kind of is, but arguably so is the whole buzhash/CAS scheme, and I see this as a relatively minor extension of it. The rolling hash works great for localizing changes in large files to a single chunk, so that data before and after the change can be deduplicated. However, with large chunk sizes and small changes, the localization gets fuzzy. My proposal is the minimum change I can think of to get some of the localization benefit back.

And it doesn't necessarily depend on changes being in the middle of files, just being sparse - if a contiguous 1kB segment changes in a 2MB chunk, then the remaining 1.9MB will all be split between the prefix and suffix; it may be 1MB and 0.9MB or 0.1MB and 1.8MB, but either way it all gets deduped. Yes, if there are 2 1kB changes at the start and end of the chunk, the middle section won't get deduped. But on average, with two changes, about half the chunk still will still be deduped, which may or may not be helpful.

And I think this is probably most useful (only useful?) for frequent backups - I'm doing hourly. If I wait longer between backups, there would presumably be more changes that would overall be denser. But I'm not positive here; going back to VM snapshots, I can imagine things like temp files ending up being fairly sparse even over long periods of time, depending on the filesystem used.

To address some narrower points:

This may well be "a lot of complexity for little win", and if so, it's not worth having. I suspect the tradeoff may come out better than you think, but then again, I tend to be overly optimistic about this sort of thing, and my use case is basically optimal for it. The only way to be sure is to implement at least a prototype.

Yes, using smaller chunks gives smaller deltas, but at the cost of worse overall compression (at least with lzma) and more memory usage. Maybe (at least for some use cases), this would give similar deduping with larger chunk sizes, reducing overall memory usage (assuming files are generally large enough to split).

I don't see how finding prefixes/suffixes sounds inefficient; it's another hash comparison. Storing them certainly takes more memory, but aside from that it's hashing a small amount of data and doing a table lookup.

And while SQLite databases may be "a few dozen MB", I have a couple firefox profiles for various purposes, and am doing hourly snapshots, so 40MB diffs (what I'm seeing) turns into over 300GB/year. Disk is cheap enough that that's not actually a problem, but the idealist optimizer in me is sad about potentially wasted space, and the engineer in me is intrigued enough to think that this might be worthwhile.

And finally, I would assume this would be optional unless it's clearly a win; I see this as similar to the old notail option on reiserfs: a flag to get that last bit of space utilization at a reasonable cost in runtime. If you don't have the RAM or it's not good for your dataset, don't use it. But if it does help in some cases (and isn't too complex), I'd like to see it as an option for when it is appropriate.

And now I'm wondering how to benchmark this; obviously if it helps for me, that's useful, but how generally useful is it? It seems like backup testing tends to be very ad-hoc. I know some sites use real-world traces to benchmark harddrives; I wonder if one of those could be adapted for backups (e.g., take a snapshot every 10k/1M/100M operations and see what the backup sizes are).

It took a bit, but I finally got around to implementing and testing a proof-of-concept for this at https://github.com/edanaher/borg/tree/poc-prefix-suffix-matching. On my particular dataset, 41 day's worth of hourly snapshots (984 total) this dropped the total usage (according to du) from 102G to 97G; roughly 5%.

That's in that awkward middle ground where I'm not going to push hard for this, but it does certainly provide some value. Also this is a first PoC with some low-hanging fruit; e.g., there's no lower-bound on matches, so things like magic bytes to identify filetypes will be deduped, despite the fact that the pointer block (containing hash, start, and end) is probably longer than the match. Also, it's not recursive, so if three blocks share a prefix, and two of them also share a longer prefix, the longer prefix may not be used.

And this is just my dataset with default chunking parameters, which as I noted is probably a better-than-average use for this.

However, before I do more experiments, I'd like to speed up the process; right now, just statting all 1M files to see if they changed takes >2min for me, while btrfs send/recv can give the list basically instantaneously, and the actual changes are <100MB among ~3k files. So I'll probably take a break from this, and if I have time for borg, spend it trying to figure out a reasonable way to tie that in (I've seen some other related issues, but don't have them handy now.)

However, before I do more experiments, [...] right now, just statting all 1M files to see if they changed takes >2min for me

While Borg is (from a purely technical perspective) of course rather slow when it comes to file/s throughput compared to what would technically be possible on the same hardware, no userspace tool can approach the speed of the built-in tools of a hash-tree FS:

Conceptually, computing the difference between two snapshots only has to compare branch pointers in the FS, which tends to be ~log(n_files) or less, while anything using POSIX APIs has to check individual files. Also notice how the branch ptr compare already includes everything about a tree, including file metadata. Tools based on public APIs need to do much more work to capture these.

Thus, zfs/btrfs send/recv are algorithmically more efficient (~O(logn) vs ~O(n)) and can further optimize their operation based on FS internals. When these advantages matter and their disadvantages do not apply, then they are a very good choice.

And while SQLite databases may be "a few dozen MB", I have a couple firefox profiles for various purposes, and am doing hourly snapshots, so 40MB diffs (what I'm seeing) turns into over 300GB/year. Disk is cheap enough that that's not actually a problem, but the idealist optimizer in me is sad about potentially wasted space, and the engineer in me is intrigued enough to think that this might be worthwhile.

I did a experiment where I dumped the sqlite file to sql files, the result isn't that bad:

Deduplicated size:
23.62 MB (With --compression none)
5.10 MB (With --compression auto,lzma)
50.94 MB (With --compression none)
13.03 MB (With --compression auto,lzma)

First two lines sqlite was dumped with sqlite3 "foo.sqlite" .dump > "foo.sqlite.sql"

Test script is here: https://gist.github.com/klausenbusk/1b3d93ef617d9a9df3bc7707aff10b31

Is there chance that borg could use different chunker params for some files (which files and if auto or manual to be discussed)? If so, is it worth to open a different issue to discuss it there?

The parameters mentioned by @edanaher seem like a sweet spot for SQLite files. Size reduction is in the range from x2.5 to x3.5 for me. Other settings don't give me significantly more deduplication.

We once had a manual compression setup in master, but later removed it in favour of much simpler to use "auto" compression mode.

So guess we don't want some difficult-to-configure chunker setup either. It is already possible to give the chunker / chunker params per borg invocation, master branch has the buzhash and a fixed chunker to choose from.

@sophie-h if you have an idea, how to solve this in an easy-to-use way and the existing methods do not suffice, open a new ticket please.

borgbackup / borg