Baqend / Orestes-Bloomfilter

Library of different Bloom filters in Java with optional Redis-backing, counting and many hashing options.
Other
839 stars 245 forks source link

Adding Population Count, looking for implementation advice #5

Closed ChrisCurtin closed 9 years ago

ChrisCurtin commented 10 years ago

Hi,

We have added a population count to the regular and Redis-backed bloom filters, but before submitting a pull request wanted to get direction on where this feature should be added.

The changes add a field or Redis key for the population, meaning the unique number of items added to the filter, understanding that the error percent is also reflected in the population counts.

For the first iteration, we've added derived classes for this capability with minor changes to the existing classes. The changes are mostly around figuring out which add calls actually changed bits in the filter.

The most invasive part was adding another method for a bulk add, since addAll today does not return a status to tell us which items were added.

However looking at the implementation we think enhancing the BloomFilters classes would be a better/cleaner way of supporting a population count vs. a derived class. In particular the Redis classes had to change some visibility on methods to allow access to the internal Jedis which made me think this should be in the existing classes.

Can you give us direction on whether or not to implement this directly in the existing classes or keep it external? (For Redis, I figured I could have a flag to NOT do the population unless explicitly created to do it, so as to not have the extra round trip)

Thanks,

Chris

DivineTraube commented 10 years ago

Hi,

it's great to see you develop such a feature. A few questions: -What is the population count? The number of add calls? The number of add calls that set a previously zero bit (i.e. unique elements +/- error rate)? -Do you want this for normal Bloom fiters, Counting or both? -Are you aware of the existing streaming algorithms for counting distinct elements? http://en.wikipedia.org/wiki/Streaming_Algorithm -Do you know you can also estimate the number of distinct elements added to the BF by counting the number of adds and counting the number of bits that are one? (Efficient for inserts, inefficient for population count queries)

I am currently working on a somewhat orthogonal but related extension: spectral Bloom filter algorithms. These allow to query an estimation for the number of inserts for a particular element in Counting Bloom filters.

If your extensions are straightforward and are an opt-in feature, I do not see why we should not included them in the actual classes.

Thanks, Felix

ChrisCurtin commented 10 years ago

Hi Felix,

Yes the population count is the # of add calls that set at least 1 bit (understanding the false positive possibilities). Our use case is to both use the filter for counting and for membership checks later. For example 'how many unique visitors to the page today' and 'did user XYZ visit the page today?'. A population count is a nice compromise with the other cardinality estimating algorithms we've looked at.

We're only interested in regular bloom filters since we wouldn't be removing items in our use case.

We've implemented a prototype for regular filters that checks the return of the setbit pipeline and increments the counter accordingly. (We also made a ton of other changes to support multiple Redis filters, passing the Jedis connections etc.) I will be working this week on documenting all the changes then I'll push it to my repository.

Thanks,

Chris

ChrisCurtin commented 10 years ago

I've submitted a Pull Request with this feature.

DivineTraube commented 10 years ago

Hi, thanks a lot for your pull request. I've integrated it and will use many parts of it for our next version, we'll release next week. It will fix a lot of things, including better concurrency & error handling, named filters and a cool builder pattern for creating new Bloom filters. I guess the best way to include counting nowadays would be to use the new HyperLogLog of Redis to have rigorous bounds on false positive probabilities that are likely much better than the current heuristic of checking for null bits: https://github.com/echen/streaming-simulations/wiki/Cardinality-Estimation:-Bloom-Filter-vs.-HyperLogLog

Btw, did you use the Bloom filter just for counting or also for membership queries? If so what was your workload in x% membership queries / y% adding / z% get population? I will also include the currently best estimator for the population of a Bloom filter which counts bits to estimate the population and hence is not so fast but quite accurate: http://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter

Best, Felix

ChrisCurtin commented 10 years ago

Hi Felix. We plan on using it for both counting and membership. We've deployed the counting part and are in development of some logic that uses the membership. I don't have a good feel for percentages in production, but we expect to do many more populations checks than adds in the application we're developing. We expect to use it to 'fail fast' by checking for keys in a Redis-backed BF before having to hit an Oracle database.

(FYI I'm doing a presentation to the Atlanta Java Users Group in September about Redis, BloomFilters and failing fast. Your library will get mentioned a few times. Once the video is up I'll let you know.)

Thanks, Chris

DivineTraube commented 10 years ago

Hi, that's great news, let me know if I can provide you with any material for your talk.

We did an almost complete rewrite of the library. All filters now support an getEstimatedPopulation method. Its runtime is linear in the number of bits. In case of Redis it uses the fast BITCOUNT command. We are currently conducting broad performance tests that will include this method. If it turns out to be slow, we will add a configuration option to incrementally update the estimation (there is an algorithm for that, described in this book).

I look forward to the video of your talk, Felix

DivineTraube commented 9 years ago

Hi, did you have a look at the new counting features? They seem to exhibit really good performance on Redis, so I guess we will stick with the BITCOUNT approach.

Best, Felix