Callidon / bloom-filters

JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash
https://callidon.github.io/bloom-filters/
MIT License
378 stars 43 forks source link

Bug: Endless recursion in getDistinctIndices with certain parameters. #34

Closed kblauhut closed 3 years ago

kblauhut commented 3 years ago

Explanation The getDistinctIndicesBis function within getDistinctIndices will fail with a Maximum call stack size exceeded error when there are too few distinct indizes being returned while n < size.

How to reproduce

const filter = new BloomFilter(39, 28);
filter.add("da5e21f8a67c4163f1a53ef43515bd027967da305ecfc741b2c3f40f832b7f82");

This code will trigger the issue.

Why does this happen?

// This returns the same hashes after n is larger than size
const hashes = hashTwice(elem, true, seed! + (size % n));
// This return from the doubleHashing function will also return the same values when called with the same hashes and different n values
return Math.abs((hashA + n * hashB) % size);

This combined with the filtering of double indizes in getDistinctIndicesBis will lead to the issue. Its important to note that this will only happen in certain cases with high hash count and a low bit count.

danielaugustin commented 3 years ago

For every n >= size the function hashTwice does produce the same hashes (hashA and hashB) for every n.

If n >= size and hashB is dividable by size the function doubleHashing returns the same index for every n. This will always produce the same parameters for the recursive calls (getDistinctIndicesBis), resulting in a stack overflow.

Example (hashA=23, hashB=21, size=7 => ind always will become 23 % 7 = 2):

(23 + 7 * 21) mod 7 = 2
(23 + 8 * 21) mod 7 = 2
(23 + 9 * 21) mod 7 = 2
(23 + 10 * 21) mod 7 = 2
...

Possible fixes:

  1. (dirty) Within getDistinctIndicesBis put a bracket around seed! + size when calling hashTwice. This delays the point where every recursive call produces the same hashes. This fix does only delay
  2. Within doubleHashing replace n * hashB with n + hashB. The term (hashA + n + hashB) % size will always produce different indizes.
  3. Before calling doubleHashing check if hashB is dividable by size. In this case add a number (!= size) to hashB.

I checked fix (2) and it seems to work. Because the link in the comment (http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4060353E67A356EF9528D2C57C064F5A?doi=10.1.1.152.579&rep=rep1&type=pdf) does not work I was not able to check if doubleHashing was implemented correctly .

Hope this helps.

folkvir commented 3 years ago

Thank you for sharing this, I will investigate this and post news as soon as possible.

folkvir commented 3 years ago

Back, Fix pushed in branch fix-34 https://github.com/Callidon/bloom-filters/tree/fix-34 Is it possible to test it with a production test?

The difference is the hashTwice(elem, true, seed! + (size % n)) which is now hashTwice(elem, true, seed! + size + n) and is now a classic while loop instead of a recursive function.

You have to know that generating distinct indexes using this technique is very time consuming. In fact if you want to generate N distinct indexes in a set of N elements, the probability to find the latest will be infinitely small in a very large set. But for a BloomFilter, we get a number of indexes equal to the number of hash functions used, so practically this is very fast.