Open ThomasWaldmann opened 7 years ago
Current master branch buzhash chunker benchmark results (i5-4200u cpu, tw's laptop):
$ python chunker_bench.py
length count dt throughput chunks_count
1000000000 1 2.586 386.728 120
100000000 10 2.633 379.811 12
10000000 100 2.509 398.494 2
1000000 1000 1.261 793.168 1
100000 10000 0.109 9203.020 1
10000 100000 0.230 4352.816 1
1000 1000000 1.835 545.060 1
100 1000000 1.798 55.614 1
10 1000000 1.793 5.576 1
1 1000000 1.803 0.555 1
Current master + quick and dirty 32bit GEAR chunker benchmark results (tw's laptop):
length count dt throughput chunks_count
1000000000 1 1.450 689.640 120
100000000 10 1.499 667.227 12
10000000 100 1.412 708.214 2
1000000 1000 0.693 1442.994 1
100000 10000 0.108 9257.274 1
10000 100000 0.225 4441.152 1
1000 1000000 1.785 560.084 1
100 1000000 1.744 57.351 1
10 1000000 1.751 5.711 1
1 1000000 1.750 0.571 1
There is also rabin:
https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc
Would be interesting to compare those three.
While reading some news about plakar, it talks about UltraCDC. Unfortunately, it's behind the closed doors of IEEE (https://ieeexplore.ieee.org/document/9894295), but maybe the author of plakar can be contacted for that. He made an implementation in Go available at https://github.com/PlakarLabs/go-cdc-chunkers/tree/main/chunkers/ultracdc
fastCDC https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf
fastCDC are some optimizations on top of the very simple GEAR rolling hash.
old ticket about performance: https://github.com/borgbackup/borg/issues/1021 benchmark from there: https://github.com/borgbackup/borg/files/276308/chunker_bench.py.txt