alexandrnikitin / bloom-filter-scala

Bloom filter for Scala, the fastest for JVM
https://alexandrnikitin.github.io/blog/bloom-filter-for-scala/
MIT License
376 stars 57 forks source link

Stable Bloom filter #3

Open alexandrnikitin opened 8 years ago

alexandrnikitin commented 8 years ago

Stable Bloom filter

kutchar commented 8 years ago

Here are some alternative JVM based implementations as inspirations ;)

A. https://github.com/jjedele/stable-bloom-filter/blob/master/src/main/java/de/jjedele/sbf/StableBloomFilter.java

B. https://github.com/alexander-chervony/stable-bloom-filter/blob/master/src/main/scala/StableBloomFilter.scala

C. https://github.com/mayconbordin/streaminer/blob/master/src/main/java/org/streaminer/stream/membership/StableBloomFilter.java

Looking at these, seems like we could use the same approach as example A and use UnsafeBitArray instead of byte[]

alexandrnikitin commented 8 years ago

@kutchar Thanks for the info. BTW, @alexander-chervony is a colleague of mine and yes, they are all essentially ports of one lib. The problem with all of them is that there's no math from the paper. They all expect calculated values. One more crucial thing is that they use int to refresh an element which is huge waste of memory. While the recommendation is to pick the Max value as small as possible. "In practice, we limit our choice of Max to 1, 3 and 7 (if higher time cost can be tolerated, larger values of Max can be tried similarly)." Which is just few bits not 4 bytes.

I pushed some code ripped out from our private repo. https://github.com/alexandrnikitin/bloom-filter-scala/commit/867596602870d562061e936340e61da88c8da857 I really feel ashamed for doing that but I don't want to protract anymore. It's Java, not polished and not finished yet. The math is implemented in straightforward way from the original paper and very slooow, it needs some math love. I put an excerpt from the paper in the comments. The only way to use it at the moment is to adjust parameters by hand for your input.