efficient / cuckoofilter

Other
964 stars 167 forks source link

Cuckoo filter constructors overestimate insert success probability #9

Open jbapple-cloudera opened 7 years ago

jbapple-cloudera commented 7 years ago

Making trees to hold one item per 12.5 bits often fails. For instance, the constructor makes a filter of size 3 mebibytes when asked to construct a filter to hold 2,013,265 items, but this is too ambitious and often fails. In my experience 16,106,127 is even more error-prone as a constructor argument if I expect to actually call Add() that many times.

I suspect the limit on Add()'s kickout rate should not be constant. "An Analysis of Random-Walk Cuckoo Hashing" shows a polylogairthmic upper bound on the length of kickouts in a different but related cuckoo hash table.

dave-andersen commented 7 years ago

I suggest a quick hack change of reducing 0.96 (which is close to the bound) to 0.94; this is a fairly insignificant increase in memory consumption for nearly all workloads, but reduces substantially the tail failure probability.

If we switch to a true BFS-based insert as I suggested in the other issue, we can increase the maximum effective cuckoo count a bit without incurring as much of a slowdown. It'd then be worth doing a bit of an empirical analysis of what the constants should be with polylog scaling -- or we could just go big and feel safe for almost all workloads.