X-Ryl669 / Frost

A backup program that does deduplication, compression, encryption
MIT License
28 stars 3 forks source link

Dedupe appears to not work as expected? #14

Closed bdb closed 6 years ago

bdb commented 6 years ago

I might be missing something, but Frost appears to not use a rolling hash.

I setup the following test. I'm using macOS 10.12 and Frost 'Current version: 2 build 568.'

This resulted in the following chunk list (according to --dump)

  Chunk 0 with UID: 1 and offset 0
  Chunk 1 with UID: 2 and offset 6182
  Chunk 2 with UID: 3 and offset 10932
  Chunk 3 with UID: 4 and offset 17940
  Chunk 4 with UID: 5 and offset 25022
  Chunk 5 with UID: 6 and offset 27012
  Chunk 6 with UID: 7 and offset 28985
  Chunk 7 with UID: 8 and offset 33570
  Chunk 8 with UID: 9 and offset 36464
  Chunk 9 with UID: 10 and offset 39457
  Chunk 10 with UID: 11 and offset 44362
  Chunk 11 with UID: 12 and offset 46341
  Chunk 12 with UID: 13 and offset 48572
  Chunk 13 with UID: 14 and offset 51622
  Chunk 14 with UID: 15 and offset 54318
  Chunk 15 with UID: 16 and offset 60003
  Chunk 16 with UID: 17 and offset 66212

Instead of creating 1 or 2 new chunks. Frost created new chunks for the whole file again.

Notice the offsets are completely different. I expected the first few to be different, but then they should have normalized to 11 bytes different than the original and the chunk UIDs of the original should have been reused.

Frost also says both revisions are basically the same size instead of rev2 being smaller:

Revision 2 happened on Sun, 05 Nov 2017 23:43:26 GMT, linked 1 files and 1 directories, cumulative size 64.8KB (backup is 25.8KB, saved 61%)
...
Revision 1 happened on Sun, 05 Nov 2017 23:34:55 GMT, linked 1 files and 1 directories, cumulative size 64.8KB (backup is 25.7KB, saved 61%)

As I said I might be missing something interpreting the dump output, so could you explain what's going on for me?

  Chunk 0 with UID: 18 and offset 0
  Chunk 1 with UID: 19 and offset 3015
  Chunk 2 with UID: 20 and offset 5133
  Chunk 3 with UID: 21 and offset 7616
  Chunk 4 with UID: 22 and offset 10086
  Chunk 5 with UID: 23 and offset 16141
  Chunk 6 with UID: 24 and offset 18143
  Chunk 7 with UID: 25 and offset 22442
  Chunk 8 with UID: 26 and offset 24875
  Chunk 9 with UID: 27 and offset 27383
  Chunk 10 with UID: 28 and offset 30116
  Chunk 11 with UID: 29 and offset 36142
  Chunk 12 with UID: 30 and offset 38412
  Chunk 13 with UID: 31 and offset 40672
  Chunk 14 with UID: 32 and offset 45051
  Chunk 15 with UID: 33 and offset 49120
  Chunk 16 with UID: 34 and offset 51485
  Chunk 17 with UID: 35 and offset 55326
  Chunk 18 with UID: 36 and offset 59357
  Chunk 19 with UID: 37 and offset 63352
  Chunk 20 with UID: 38 and offset 66506

Thanks

X-Ryl669 commented 6 years ago

Thank you for the detailed report.

Frost use a rolling checksum based on Adler32 to split data in chunk, with the following restrictions:

  1. The minimum chunk size is fixed (to avoid chunk of 11 bytes, like in your example)
  2. The maximum chunk size is also fixed (to avoid huge chunk that never fit the division rule for chunking, see below)
  3. The chunker uses two divisors to decide if it's time to split in chunk or not.

I've implemented the algorithm described in this document: http://www.philippheckel.com/files/syncany-heckel-thesis.pdf Basically, the chunker is called Two Threshold Two Divider.

It runs a rolling checksum over the input data with two hash algorithm, one being the Adler32 and the other being SHA1 (used after the chunk split position is found). When it has received enough data (1856 bytes per default), it starts trying to divide the Adler32 rolling checksum with the low threshold divider (that is, a divider that divides a lot of numbers). If it succeed dividing, then it remember that there is a backup split position here. It also tries to divide with a high divider threshold (less chance to succeed). If it succeed while still in the good chunk size range, then a chunk is emitted. Else it falls back to the low divider's position.

You can have a look at file TTTDChunker.cpp for the algorithm, it's very easy to follow.

Now, to answer why it did not find a "common pattern" in your example, I guess it's because the file is too small for the algorithm to find a "common" divider. It only accept a min chunk of 1856 bytes, so it can not give you a 11 bytes chunk and thus, split at 11 bytes. Even after 1856 bytes, the algorithm must "forget" the 11 bytes of "insertion" pollution before it can recover to a viable divider. Remember that you are folding a huge input in a 32 bits number. Depending on the starting point, it might never find a divider (in the worst case).

Try with a huge file (at least some MB) and you'll see it does recover after some thousands of bytes. This is one of the test suite of the application, where it generate some big random file, backup it up, insert and modify some bytes but none after half the file, and re-backup. It'll usually recover and avoid backing up the complete file twice.

The reason why I made it this way is because of the compromise between deduplication performance and size of the index. Allowing small chunks really explodes the index's size and then, the performance degrades quickly when backing up big folders.

If you change the min size (in the TTTDChunker option) to, let's say 10 bytes, then it'll split after "hello world" and work as you expect, but it'll not perform well on large filebase, since a huge lot of chunk will be generated and searched for each new chunk found. Typically, dividing by 10 the min chunk size multiplies by 10 the number of chunks in the index, and multiply by log(10) = 2.3 the processing time.

So in your example, with a TTTD min size of 10, you'd have ~200x more chunks in your index, and a ~6x longer backup procedure. I'm not sure it make sense.

bdb commented 6 years ago

But if you are rolling a checksum window of say 64bytes across the whole input, then the chunks should eventually match even with a small change in the file. Because that small window decides the break points and not the chunk size. (You would just not attempt to break until your min chunk size is reached.)

Plus I don't see where your Adler window is. It appears to be a straight Adler32 implementation which is not rolling.

This was from a quick look at the file mentioned, so again, maybe I'm missing something.

Thanks for your time.

X-Ryl669 commented 6 years ago

I'm not sure we are speaking of the same thing. A rolling checksum is an algorithm that adds one byte to the checksum, updating the checksum, but it does not remove the previous summed bytes.

It's used here It's well defined for Adler32 (check Adler32.cpp for the implementation)

It's being used because one does not need to recompute the complete checksum of a fixed window for each byte like you seem to infer, so it's very fast.

Typically, the rolling CS of a string at position p is the same as the real checksum of the midstring(0, p). It's not the checksum of midstring(p - 64, p) or whatever window's size.

Adding one byte to the (rolling) checksum is fast (unlike a usual hash function) and you immediately get the checksum's value (also unlike a usual hash function). Remember that this must be O(n) since it's done for all the backup data set.

If you had a 64 bytes window, then every time any 64 bytes are similar, it would create the same checksum. However this would trigger computing SHA1 for the past data, and this would fail to match the chunk's hash with a previous one (most of the time) and you'd have lost a lot of time rejecting the checksum. Computing SHA1 is expensive, you can't do that for each 64 bytes of the backup data set.

That being said, your idea makes sense for a different algorithm that would use a bunch of rolling checksum instances (each shifted by some fixed amount of bytes) to find out if any of them would divide the threshold, and stop on the first one that does. That would trigger on a "one byte" insertion case (up to the bunch size), but that would increase the memory consumption by a whole margin (at least by the number of instance required * Adler32's state size). It seems unrealistic to have 1856 instances for example. I'll try to figure out if it makes sense and how to test an implementation of your idea.

X-Ryl669 commented 6 years ago

Ok, I can try changing this in a new branch with some (smaller) window. The sum being easy to compute, this would only add 4 additions (actually subtraction) per byte. I've found this: https://software.intel.com/en-us/articles/fast-computation-of-adler32-checksums, I might implement if it's too slow.

That'll change the interface of TTTDChunker (to have an additional option: the rolling checksum memory size) and the implementation of Adler32's interface. This would break compatibility with a previous backup set not using the same option (requiring new complete backup set).

I'll let you know when I'm done.

bdb commented 6 years ago

Ok, I think we are talking about 2 separate things with the same term. Plain Adler is a running hash, not rolling. To make it rolling you need to add the sliding window. Bup does this

And this explains in the poor dedupe I was seeing. With a rolling hash only 1 (at most 2) chunks would have been new instead of all 17.

Also the rolling hash doesn't change what you SHA, it just determines your chunk break points. So there's pretty much no penalty to speed.

As an aside, SHA1 is pretty damn fast on modern machines. SHA2 for that matter. Both are hardware accelerated.

Thanks for looking into this.

X-Ryl669 commented 6 years ago

Ok, I've fixed this locally and I'm integrating new features too for a new release soon.