Closed arv closed 8 years ago
Other contenders (according to wikipedia) are Rabin-Karp, Cyclic Polynomial, and moving average.
I think we can pick this objectively, by measuring stddev and distribution.
Buzhash = Cyclic Polynomial
oh
I implemented Rabin-Karp.
https://github.com/attic-labs/noms/compare/master...arv:chunk-demo
I'll post some numbers
Size: 10000000, Count: 1234, Avg: 8104, StdDev: 6868
Size: 988546604, Count: 120167, Avg: 8226, StdDev: 7053
Size: 6704232, Count: 813, Avg: 8236, StdDev: 6645 Size: 5700295, Count: 664, Avg: 8591, StdDev: 7240 Size: 66201, Count: 11, Avg: 5092, StdDev: 4562
Size: 510232, Count: 74, Avg: 6947, StdDev: 5437 Size: 1576498, Count: 177, Avg: 8945, StdDev: 7139 Size: 154628, Count: 20, Avg: 8108, StdDev: 4818 Size: 776142, Count: 92, Avg: 8500, StdDev: 7522
Size: 10000000, Count: 1152, Avg: 8683, StdDev: 7318
Size: 988546604, Count: 120797, Avg: 8183, StdDev: 6988
Size: 6704232, Count: 812, Avg: 8262, StdDev: 6634 Size: 5700295, Count: 658, Avg: 8666, StdDev: 7700 Size: 66201, Count: 9, Avg: 7746, StdDev: 5914
Size: 510232, Count: 61, Avg: 7953, StdDev: 6592 Size: 1576498, Count: 203, Avg: 7776, StdDev: 7396 Size: 154628, Count: 11, Avg: 14473, StdDev: 11474 Size: 776142, Count: 98, Avg: 7999, StdDev: 6846
Size: 10000000, Count: 1023, Avg: 9783, StdDev: 8916
Size: 988546604, Count: 108134, Avg: 9141, StdDev: 7862
Size: 6704232, Count: 734, Avg: 9123, StdDev: 7706 Size: 5700295, Count: 598, Avg: 9538, StdDev: 8291 Size: 66201, Count: 8, Avg: 9397, StdDev: 8303
Size: 510232, Count: 67, Avg: 7566, StdDev: 6569 Size: 1576498, Count: 182, Avg: 8670, StdDev: 7402 Size: 154628, Count: 15, Avg: 10866, StdDev: 8289 Size: 776142, Count: 92, Avg: 8379, StdDev: 6634
Added Adler32 data too
I updated the numbers above. I realized that it is not fair to include the last chunk in the average or the std deviation.
Oh bother. I made this with the old number: https://docs.google.com/spreadsheets/d/10xAm19qPFC1yGkZg6GADvDubdgmtQg3ykUp4I1gS8pI/edit#gid=0
I'll update the spreadsheet later. Or if you output the data in CSV, I might do something crazy with noms to generate the chart :-)
I updated my branch to generate a csv file
go build chunking.go && ./chunking ~/Downloads/ > chunking.csv
The random stream is the same for all hash functions now.
We have effectively chosen BuzHash. I'm not sure there's anything actionable in this bug anymore. Presumably at some point this question will come up again and we can refer back to this bug, but for now, I'm closing it.
Right now we are using buzhash. The common contender is adler32 which is used by rsync and bup.