attic-labs / noms

The versioned, forkable, syncable database
Apache License 2.0
7.44k stars 266 forks source link

Pick a hash function for blob chunking #158

Closed arv closed 8 years ago

arv commented 9 years ago

Right now we are using buzhash. The common contender is adler32 which is used by rsync and bup.

aboodman commented 9 years ago

Other contenders (according to wikipedia) are Rabin-Karp, Cyclic Polynomial, and moving average.

I think we can pick this objectively, by measuring stddev and distribution.

arv commented 9 years ago

Buzhash = Cyclic Polynomial

aboodman commented 9 years ago

oh

arv commented 9 years ago

I implemented Rabin-Karp.

https://github.com/attic-labs/noms/compare/master...arv:chunk-demo

I'll post some numbers

arv commented 9 years ago

Rabin-Karp

Random Stream

Size: 10000000, Count: 1234, Avg: 8104, StdDev: 6868

Large Movie

Size: 988546604, Count: 120167, Avg: 8226, StdDev: 7053

Jpegs

Size: 6704232, Count: 813, Avg: 8236, StdDev: 6645 Size: 5700295, Count: 664, Avg: 8591, StdDev: 7240 Size: 66201, Count: 11, Avg: 5092, StdDev: 4562

Pdfs

Size: 510232, Count: 74, Avg: 6947, StdDev: 5437 Size: 1576498, Count: 177, Avg: 8945, StdDev: 7139 Size: 154628, Count: 20, Avg: 8108, StdDev: 4818 Size: 776142, Count: 92, Avg: 8500, StdDev: 7522

BuzHash

Random stream

Size: 10000000, Count: 1152, Avg: 8683, StdDev: 7318

Large Movie

Size: 988546604, Count: 120797, Avg: 8183, StdDev: 6988

Jpegs

Size: 6704232, Count: 812, Avg: 8262, StdDev: 6634 Size: 5700295, Count: 658, Avg: 8666, StdDev: 7700 Size: 66201, Count: 9, Avg: 7746, StdDev: 5914

Pdfs

Size: 510232, Count: 61, Avg: 7953, StdDev: 6592 Size: 1576498, Count: 203, Avg: 7776, StdDev: 7396 Size: 154628, Count: 11, Avg: 14473, StdDev: 11474 Size: 776142, Count: 98, Avg: 7999, StdDev: 6846

Adler32

Random Stream

Size: 10000000, Count: 1023, Avg: 9783, StdDev: 8916

Large Movie

Size: 988546604, Count: 108134, Avg: 9141, StdDev: 7862

Jpegs

Size: 6704232, Count: 734, Avg: 9123, StdDev: 7706 Size: 5700295, Count: 598, Avg: 9538, StdDev: 8291 Size: 66201, Count: 8, Avg: 9397, StdDev: 8303

Pdfs

Size: 510232, Count: 67, Avg: 7566, StdDev: 6569 Size: 1576498, Count: 182, Avg: 8670, StdDev: 7402 Size: 154628, Count: 15, Avg: 10866, StdDev: 8289 Size: 776142, Count: 92, Avg: 8379, StdDev: 6634

arv commented 9 years ago

Added Adler32 data too

arv commented 9 years ago

I updated the numbers above. I realized that it is not fair to include the last chunk in the average or the std deviation.

aboodman commented 9 years ago

Oh bother. I made this with the old number: https://docs.google.com/spreadsheets/d/10xAm19qPFC1yGkZg6GADvDubdgmtQg3ykUp4I1gS8pI/edit#gid=0

aboodman commented 9 years ago

I'll update the spreadsheet later. Or if you output the data in CSV, I might do something crazy with noms to generate the chart :-)

arv commented 9 years ago

I updated my branch to generate a csv file

go build chunking.go && ./chunking ~/Downloads/ > chunking.csv

https://github.com/attic-labs/noms/compare/master...arv:chunk-demo#diff-4bf97fbc03196b08152fed643bb965fc

arv commented 9 years ago

The random stream is the same for all hash functions now.

ghost commented 8 years ago

We have effectively chosen BuzHash. I'm not sure there's anything actionable in this bug anymore. Presumably at some point this question will come up again and we can refer back to this bug, but for now, I'm closing it.