Closed kblauhut closed 3 years ago
For every n >= size
the function hashTwice
does produce the same hashes (hashA
and hashB
) for every n
.
If n >= size
and hashB
is dividable by size
the function doubleHashing
returns the same index for every n
. This will always produce the same parameters for the recursive calls (getDistinctIndicesBis
), resulting in a stack overflow.
Example (hashA=23, hashB=21, size=7
=> ind
always will become 23 % 7 = 2
):
(23 + 7 * 21) mod 7 = 2
(23 + 8 * 21) mod 7 = 2
(23 + 9 * 21) mod 7 = 2
(23 + 10 * 21) mod 7 = 2
...
Possible fixes:
getDistinctIndicesBis
put a bracket around seed! + size
when calling hashTwice. This delays the point where every recursive call produces the same hashes. This fix does only delaydoubleHashing
replace n * hashB
with n + hashB
. The term (hashA + n + hashB) % size
will always produce different indizes.doubleHashing
check if hashB
is dividable by size
. In this case add a number (!= size
) to hashB
.I checked fix (2) and it seems to work. Because the link in the comment (http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4060353E67A356EF9528D2C57C064F5A?doi=10.1.1.152.579&rep=rep1&type=pdf) does not work I was not able to check if doubleHashing
was implemented correctly .
Hope this helps.
Thank you for sharing this, I will investigate this and post news as soon as possible.
Back, Fix pushed in branch fix-34 https://github.com/Callidon/bloom-filters/tree/fix-34 Is it possible to test it with a production test?
The difference is the hashTwice(elem, true, seed! + (size % n))
which is now hashTwice(elem, true, seed! + size + n)
and is now a classic while loop instead of a recursive function.
You have to know that generating distinct indexes using this technique is very time consuming. In fact if you want to generate N distinct indexes in a set of N elements, the probability to find the latest will be infinitely small in a very large set. But for a BloomFilter, we get a number of indexes equal to the number of hash functions used, so practically this is very fast.
Explanation The
getDistinctIndicesBis
function withingetDistinctIndices
will fail with a Maximum call stack size exceeded error when there are too few distinct indizes being returned while n < size.How to reproduce
This code will trigger the issue.
Why does this happen?
This combined with the filtering of double indizes in
getDistinctIndicesBis
will lead to the issue. Its important to note that this will only happen in certain cases with high hash count and a low bit count.