borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11k stars 739 forks source link

new chunking algorithms #3026

Open ThomasWaldmann opened 7 years ago

ThomasWaldmann commented 7 years ago

fastCDC https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf

fastCDC are some optimizations on top of the very simple GEAR rolling hash.

old ticket about performance: https://github.com/borgbackup/borg/issues/1021 benchmark from there: https://github.com/borgbackup/borg/files/276308/chunker_bench.py.txt

ThomasWaldmann commented 7 years ago

Current master branch buzhash chunker benchmark results (i5-4200u cpu, tw's laptop):

$ python chunker_bench.py 
    length   count      dt throughput chunks_count
1000000000       1 2.586 386.728    120
 100000000      10 2.633 379.811     12
  10000000     100 2.509 398.494      2
   1000000    1000 1.261 793.168      1
    100000   10000 0.109 9203.020      1
     10000  100000 0.230 4352.816      1
      1000 1000000 1.835 545.060      1
       100 1000000 1.798 55.614      1
        10 1000000 1.793 5.576      1
         1 1000000 1.803 0.555      1
ThomasWaldmann commented 7 years ago

Current master + quick and dirty 32bit GEAR chunker benchmark results (tw's laptop):

    length   count      dt throughput chunks_count
1000000000       1 1.450 689.640    120
 100000000      10 1.499 667.227     12
  10000000     100 1.412 708.214      2
   1000000    1000 0.693 1442.994      1
    100000   10000 0.108 9257.274      1
     10000  100000 0.225 4441.152      1
      1000 1000000 1.785 560.084      1
       100 1000000 1.744 57.351      1
        10 1000000 1.751 5.711      1
         1 1000000 1.750 0.571      1
rumpelsepp commented 7 years ago

There is also rabin:

https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc

Would be interesting to compare those three.

Glandos commented 11 months ago

While reading some news about plakar, it talks about UltraCDC. Unfortunately, it's behind the closed doors of IEEE (https://ieeexplore.ieee.org/document/9894295), but maybe the author of plakar can be contacted for that. He made an implementation in Go available at https://github.com/PlakarLabs/go-cdc-chunkers/tree/main/chunkers/ultracdc