addthis / stream-lib

Stream summarizer and cardinality estimator.
Apache License 2.0
2.26k stars 556 forks source link

Add cardinality support to BloomFilter. #133

Open b4hand opened 7 years ago

b4hand commented 7 years ago

This is based on the formula provided by the Wikipedia article:

https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter

b4hand commented 7 years ago

The tests pass for me locally on Java 7, and the build failure in Travis looks to be completely unrelated to this code. I'm not sure what to do about it.

b4hand commented 7 years ago

Ultimately, I think it would be nice to make BloomFilter implement ICardinality as well, but that will require a larger refactor since currently BloomFilter doesn't accept hashes of items directly. I think it is possible to support that functionality though with some bigger changes.

b4hand commented 7 years ago

Looking at the build history #108 has the same build failure as well, so it seems highly unlikely that the build failure is due to this code.

b4hand commented 7 years ago

I rebased to the latest master which drops the openjdk7 build, so the build now passes.

mythguided commented 7 years ago

Based on my read of the paper cited there, I'd want something that helps check whether A is sufficiently far from N for the estimate to be valid.

(Sorry for the unintentional close there)

b4hand commented 7 years ago

I believe when A approaches N, that means fractionOfBits approaches 1 (ie. the filter is saturated). According to the documentation for Math.log1p, as -fractionOfBits approaches -1, Math.log1p(-fractionOfBits) approaches negative infinity which means the expression -m * Math.log1p(-fractionOfBits) / hashCount will become positive infinity. Math.round says that it returns Long.MAX_VALUE in this case, so that seems reasonable behavior to me. I can play around with a few values and see if it produces reasonable numbers at these outliers, but I specifically chose Math.log1p instead of Math.log for this behavior.

The alternative given in the paper is to use equation 5 under these circumstances for an alternative estimate. What threshold do you think is appropriate for "sufficiently far"?

abramsm commented 7 years ago

can you add more tests of larger, random data sets to show this works for non-trivial use cases?