FastFilter / fastfilter_java

Fast Approximate Membership Filters (Java)
Apache License 2.0
238 stars 27 forks source link

Select vs. ranked for succinct counting blocked bloom filter #28

Open fredpflintstone opened 2 years ago

fredpflintstone commented 2 years ago

What is the difference between the select version and the ranked version? Thanks.

thomasmueller commented 2 years ago

Yes, documentation is needed... I'm currently away, and won't have time this week sorry. But next week should be ok.

thomasmueller commented 2 years ago

So, the "succinct counting blocked Bloom filter", as the name implies, is a blocked Bloom filter, and at the same time a counting Bloom filter. The data structure contains 3 areas:

There are two ways to succinctly count how often the bits in the first area were set: using "rank", and using "select".

Using "select" may be a bit easier to explain. Say, in the Bloom filter area, we have a block of 64 bits as follows:

00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000001

So only bit 0 and bit 8 are set. The question is, how often (the count). This is stored in a block of 64 bits in the second area. Only bits that are set in the first area need to be counted, so we only need 2 counters. Those are encoded in the number of zeros between two bit that are set. E.g. for counts 1 (for Bloom filter bit 0) and 3 (for Bloom filter bit 8), it is:

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00001001

If we increment the count for bit 0 from 1 to 2, we shift all the bits to the left, and get:

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00010010

If we decrement the count for bit 0 from 1 to 0, we reset the low bit and shift, so we get:

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000100

This was for "select". (I called it "select" because the count is read using the "select" operation).

"Rank" is similar to a https://en.wikipedia.org/wiki/Hash_array_mapped_trie. For "rank", we count the number of bits that are set in the Bloom filter area (in this case it's 2). So the lowest 2 bits are used to say whether the count is higher than 1. For those that are set, we need one more bit in this area, to the left, until there are none left. For counts 1 (for Bloom filter bit 0) and 3 (for Bloom filter bit 8), it is:

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000110

If we increment the count for bit 0 from 1 to 2, we set the last bit and shift, to get:

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00001011

Both "rank" and "select" use roughly the same space (I think rank uses a bit less on average).

The overflow area is needed if a block in the counter area grows too large, in which case a pointer to the overflow area is stored, and the highest bit is set.

Well I see it is very, very hard to explain, and understand. I'm afraid I need to give up here... In order to explain this, more time, and more examples are needed...