Goals/purpose - Githubissues

richardstartin commented 5 years ago

This looks like a potentially interesting and useful library. What is your goal for it? Is it intended to become a releasable and useable java library, or more like a research artefact?

lemire commented 5 years ago

@richardstartin

The goals are not precisely defined at this time, but it is definitively for research purposes. This is mostly @thomasmueller 's work, and Thomas has some intriguing (and sophisticated) ideas on how to build better filters (better than Bloom). I have been helping out.

I can't speak for Thomas, but I think that there is an open invitation for collaboration.

To make the story short, it is just a fact that you can build filters than are 1) simple 2) fast [faster than Bloom] and 3) concise [smaller than Bloom]. It can be all three or some subset. This being said, the design space is enormous.

My current interest is in better understanding block Bloom filters. They are conceptually very simple. You take a key... you locate one bucket (block of bits) where it goes, and then check the values of some number of bits in this bucket to check for the presence of the key. This is an old idea, but there are many interesting variations on it.

One reason why we should like block Bloom filters is that you can make it so that you have at most one expensive cache miss per query (not counting TLB).

Another intriguing design is the Xor filter... it is faster than Bloom and more concise too... You have just three hits of the RAM, and the operations are super cheap.

thomasmueller commented 5 years ago

It's definitely the goal to make it usable and useful for production. That will need some work, for example code coverage tests, documentation, wrappers, continuous integration, good benchmarks with more real-world data, comparison to other tools (like the Guava Bloom filter).

My current interest is researching new algorithms and improve speed. The xor filter for example, which is already described somewhat in literature but this is the first useful implementation as far as I know. Right now I'm working on the succinct counting Bloom filter, which should need half the space of the regular counting Bloom filter. Then there are some new variants of the cuckoo filter that are faster / need less space than the ones originally described.

Yes, there is an open invitation for collaboration.

thomasmueller commented 5 years ago

I added a first readme file (updates are welcome). Closing this issue.

FastFilter / fastfilter_java

Goals/purpose #3