Open b4hand opened 7 years ago
The tests pass for me locally on Java 7, and the build failure in Travis looks to be completely unrelated to this code. I'm not sure what to do about it.
Ultimately, I think it would be nice to make BloomFilter implement ICardinality as well, but that will require a larger refactor since currently BloomFilter doesn't accept hashes of items directly. I think it is possible to support that functionality though with some bigger changes.
Looking at the build history #108 has the same build failure as well, so it seems highly unlikely that the build failure is due to this code.
I rebased to the latest master which drops the openjdk7 build, so the build now passes.
Based on my read of the paper cited there, I'd want something that helps check whether A is sufficiently far from N for the estimate to be valid.
(Sorry for the unintentional close there)
I believe when A approaches N, that means fractionOfBits
approaches 1 (ie. the filter is saturated). According to the documentation for Math.log1p
, as -fractionOfBits
approaches -1, Math.log1p(-fractionOfBits)
approaches negative infinity which means the expression -m * Math.log1p(-fractionOfBits) / hashCount
will become positive infinity. Math.round
says that it returns Long.MAX_VALUE
in this case, so that seems reasonable behavior to me. I can play around with a few values and see if it produces reasonable numbers at these outliers, but I specifically chose Math.log1p
instead of Math.log
for this behavior.
The alternative given in the paper is to use equation 5 under these circumstances for an alternative estimate. What threshold do you think is appropriate for "sufficiently far"?
can you add more tests of larger, random data sets to show this works for non-trivial use cases?
This is based on the formula provided by the Wikipedia article:
https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter