Closed nadalle closed 2 years ago
As an additional note, storing chunks post-compression leaks some information on large files too. For example, imagine I store a compressible file such as a VM image. After it gets chunked and compressed, an attacker can see a sequence of blocks, each of which was compressible by a certain amount. If the attacker has a copy of the file, this ought to be sufficient to show that borg stored that large file.
The small file fingerprinting is a somehow known problem.
If one uses compression, the sizes are not as precisely as original, but of course an attacker could also try matching with all compression algorithms / levels if he has the original files.
The large file fingerprinting would not work like that.
The chunker is seeded with a random value that is part of borg's key material and different for each repo. So the attacker's and the victim's chunker will cut chunks at different places, so the sequence of chunk sizes does not give a fingerprint. Also, the overall size of all chunks of such a large file after compression is influenced by that, so that's not easy to match either.
The random chunk seed is only useful iff an attacker cannot observe how a known (big) file is chunked. Additionally, chunk boundaries are low in entropy due to their small variance.
Broadly speaking, this sort of fingerprinting is difficult to defeat in an online backup system, because there are more channels to consider (like I/O timing of the source read operations leaking into the I/O timing of the network side).
Some other tools (duplicacy, casync to name two) chunk an aggregate (basically a tar); similar issues regarding compression apply. These are additionally vulnerable to CRIME/BREACH-like attacks between aggregate-co-located files, if compression is used (as is .tar.gz & friends).
https://en.wikipedia.org/wiki/BREACH https://en.wikipedia.org/wiki/CRIME
Regarding directory order - borg traverses in inode-ordering (since 1.1), not native (usually hashed) directory order, or sorted by name/mtime/whatever. 1.0 uses hashed order, which of course results in poor performance.
Off-hand I'd say neither hashed dent order nor inode order recovery are really that valuable. Inode order tells you something about the allocation order of the inodes (depending on the FS), hashed order tells you... something?
See PR ^^^.
While thinking about attacks against the random chunker seed: we could move from a static seed stored in the key to a per-backup (or even per-file) generated seed.
Update: no, that does not work. We always need to chunk in the same way (per repo) or new chunks made from same input file won't dedup against old chunks made from that file.
Documented by #3710 and #3720.
As an additional note, storing chunks post-compression leaks some information on large files too. For example, imagine I store a compressible file such as a VM image. After it gets chunked and compressed, an attacker can see a sequence of blocks, each of which was compressible by a certain amount. If the attacker has a copy of the file, this ought to be sufficient to show that borg stored that large file.
Actually, I think it's a bit worse than that for large files. Even without compression, if an attacker has a copy of a large file that may be in your repo then they can test for the presence of that file with a high degree of certainty, and if the file is present they can extract information equivalent to having the chunker seed.
This is because, as commented in src/borg/_chunker.c, "XORing the entire table by a constant is equivalent to XORing the hash output with a different constant", so when we XOR the entire table by the chunker seed, it's the same as having a chunker seed of 0 but XORing the hash value by some other constant each time we test it. As by default we're only testing the lowest 21 bits of the hash value, only the lowest 21 bits of this other constant matter. That's important because although there are 2^32 possible chunker seeds many of them are equivalent to each-other and there are really only 2^21 different chunking patterns.
To verify this, chunk some input using a random chunker seed and record the cutpoints, and then look at the buzhash values at those cutpoints with a chunker seed of 0. You'll see that there's a particular value that the lowest 21 bits of that 0-seed buzhash take at every cutpoint.
2^21 is a small enough number that it's feasible to build a database of how a particular file would map to compressed chunk sizes for each of the 2^21 possible values. Armed with that database, it's very quick to inspect a borg repo to determine if that file is present, and to get the lowest 21 bits of the transformed chunker seed if so. I've tested this with a couple of ~10M files, it works quite well.
Given that the small files information leak can't be fixed, maybe there's not much point worrying about this leak too much. Perhaps it isn't worthwhile to improve the situation at the cost of making the chunker slower.
What about jumbling up the buzhash table when applying the seed, using some other key material that's kept alongside the chunker seed to determine which entry moves to which slot ? This would add some code complexity but has no chunker speed penalty.
Because buzhash never uses an input byte as anything other than an index into the table, and it never uses anything other than an input byte as an index into the table, shuffling the table is equivalent to applying a byte-to-byte translation to the input before buzhashing. The buzhash is still a buzhash. I can't think of any way this would make any attacker easier, even if the attacker somehow determines the mapping.
Well, while that would be not a breaking change, it could still blow up things for some people, by doubling the repo size when that change in chunking is introduced. And that big size would stay as long as old archives made with the old chunking method are present.
I was thinking that the change should only apply to newly created repos, when a new chunker seed is being generated anyway. As this change doesn't address the small files info leak, it's hard to justify the disruption of imposing it on existing repos.
Anyone with an existing repo who's very concerned about information leaks in large files in particular can choose to init a new repo and make future backups to that, and either delete the old repo after a while or bundle it up using tar and gnupg.
I had a go at implementing a permutation of the buzhash table: https://github.com/borgbackup/borg/pull/4921
However, there's a limit to how resistant buzhash can be no matter what changes are made to the table constants. The buzhash operations are all linear, so an attacker could treat all 8192 bits of the table as unknowns for Gaussian elimination, with each cutpoint on known data giving 21 equations. They'd need about 390 known cutpoints to be able to solve for the entire table.
Without this change: an attacker needs 1 known cutpoint to recover the chunker parameters.
With this change: the attacker can definitely still recover the chunker parameters with 390 known cutpoints. There may be another attack that needs fewer cutpoints, but if it's binary data that's being chunked then cracking it with 80 known cutpoints is a bound on what's possible, because the chunker permutation is 80*21 bits of information. On the other hand if it's ascii data using only 64 distinct byte values then the attacker only needs a quarter of the buzhash table and the bound is more like 25 cutpoints.
There is another change that would help: rather than always cutting the chunk at the start of the 4095 byte buzhash window, compute a keyed digest of the window after every buzhash match, and use the low 12 bits of the digest to determine the offset into the window of the cutpoint.
That effectively takes back 12 of the 21 bits of information that an attacker gets with each known cutpoint. I think it would have fairly low cost because borg already computes a digest of each chunk and it's only an extra digest of 4095 bytes out of every 2M.
just had an idea to solve this with an size-obfuscating meta-compressor, please review #5510.
note that #5510 (obfuscate all chunk sizes, even for small files if there is only 1 chunk) solves a slightly different issue than #5070 (make chunk sizes less predictable for bigger files resulting in multiple chunks).
note: older borg could not "decompress" chunks made with this new compressor. but as it is opt-in via -C obfuscate,...
, that should be rather obvious. as the change isn't very intrusive, it could be even backported to 1.1-maint.
If an attacker could cause the same set of small files to be encoded repeatedly by messing with the repo between runs, then they could narrow down the actual compressed size by doing that a few times. A repeatable pseudorandom padding seeded with something the attacker doesn't know would resist that.
I suggest also processing each directory in a pseudorandom order when in obfuscate mode, if that isn't already done.
Even with this obfuscation, a large set of small files is going to have a fairly distinctive fingerprint. Making the max padding level exponential in a pseudorandom function might help to complicate any attack, for example a padding of up to 50% of file size for 20% of files, up to 100% of file size for 10% of files, up to 200% for 5% of files and so on.
I think with #5510 this issue is fixed.
The (optional) obfuscation has now 2 modes:
If used, it can obfuscate to various degrees:
If there is a need, more modes/levels could be added later via a different SPEC value.
So, concerning the issues found here, I think this PR is enough to solve them IF one is willing to spend quite some space overhead for it (so the original [compressed] chunk size basically disappears in random size noise).
It is clear that:
Directory order would need a change at a different place in the code (and soon likely even at 2 places, when borg can also be fed with a list of filenames to back up instead of just doing its builtin recursion) and would make borg slower and (as we've found above). And it is maybe no issue, because we use inode order (not alphabetical order). Even if that was an issue somehow, it would also work via the size leak and that can be addressed just by adding enough size noise.
There is still #5070 improving the chunker, not sure how to proceed with that.
I guess it would be useful:
But:
@ncleaton I appreciate the lots of work you've put into this, just thinking about what makes most sense.
@ncleaton @nadalle @enkore any feedback?
guess this is fixed, see above.
After talking to Thomas Waldmann on IRC, I'm opening a ticket mostly to document this issue.
Currently, borg's model for storing chunks in the repository leaks information about the size and (potentially) directory structure of small files. Size and directory structure information can often be used to determine information about the content of the files themselves, through various fingerprint / watermark attacks.
Files in borg are chunked, compressed, and then the chunks are stored in the repository using PUT commands. These PUT commands necessarily contain the size of the chunk being stored, and additionally the segment files in the repository contain these sizes in easily parsed form.
For small files, the size of the chunks prior to compression will typically just be the size of the file itself. Additionally, a great many modern file formats are not compressible, and in any event an attacker can simply compress the data too. In this way, the chunk sizes can leak information about the size of small files.
Additionally, chunks will tend to be stored in temporal order in their associated segment files. Since the order borg processed the files in will be some directory ordering, the temporal order in the repository should typically match a directory sort order, which leaks additional information.
As a hypothetical example of how size and directory structure information can be used as a data fingerprint, consider the common method of fingerprinting CD's based on their track length. Even if a CD was encoded to mp3, it's typically possible to convert file size back to track length by assuming a standard bitrate.
It should then be evident that a database of CD track lengths is sufficient to show that a directory of encrypted files contains the CD tracks with high confidence, just by comparing the ratios of the file sizes to the ratios of track lengths. This can be done without reading any of the data, just the file sizes.
Many other more advanced sorts of watermarking attacks are possible.
There are various potential ways to solve this information leak. A few ideas:
Regardless of whether this is fixed any time soon, it seemed worth documenting better. The current documentation states "[...] all your files/dirs data and metadata are stored in their encrypted form into the repository" so a leak of file size and directory order information seems notable.