Callidon / bloom-filters

JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash
https://callidon.github.io/bloom-filters/
MIT License
369 stars 42 forks source link

Reusing Hashes for faster lookups #60

Open pubkey opened 1 year ago

pubkey commented 1 year ago

Is your feature request related to a problem? Please describe.

My Problem: I replicate bloom filters from many clients (up to 1000) to my server. The server then has to lookup a given key "foobar" and check which of the replicated filters contains the key. My bloom filters are pretty big and need about 13 hashes. To check which replicated filter has the key, the server would then have to do 13*1000 hash operations.

Describe the solution you'd like

A way to precalculate the hash once and then lookup the existence in multiple bloom filters without having to hash again.

import { hash } from 'bloom-filters';
const lookupHash = hash(13, 'foobar');

const foundInFilters = bloomFilters.filter(bf => bf.hasHash(lookupHash));

Acceptation criterias

For testing we could run a test that ensures that lookups for pre-calculated hashes are exactly equal to normal key lookups.

Additional context

Also a question: Is it possible to use a different hash function? My plan is to run the hashing inside of WebAssembly for better performance, so maybe even an async hash function would be useful.

folkvir commented 1 year ago

Hello ✋ Thank for using our library! First, we won't add any new function not related to the associated papers of the filters. But it does not mean you can't do it 👍 You can achieve what your looking for by following the rules for checking existence of a key in the filter, aka the indexes

  public filter_has_in(element: HashableInput, bfs: BloomFilter[]): BloomFilter[] {
    // compute once the indexes, be aware that all your filters must have same _size, _nbhashes and seed
    // You may have undesirable side-effects if it is not the case
    const indexes = bfs[0]._hashing.getIndexes(element, bfs[0]._size, bfs[0]._nbHashes, bfs[0].seed);
    const filtered = [];
    for (let i = 0; i < bfs.length; i++) {
      let ok = true;
      // similar to bfs[i].has() we need to know if all indexes are in the internal structure
      for (let j = 0; j < indexes.length; j++) {
        if (!bfs[i]._filter.has(indexes[j])) {
          ok = false;
          break; // no need to continue, one false index means it is definitely not there, save us many loops
        }
      }
      if (ok) {
        filtered.push(bfs[i]);
      }
    }
    // return the filters which may have the key.
    return filtered;
  }

I did not test it locally but it should work. Give it a try and let us know if it fits your needs. (CC @Callidon )

pubkey commented 1 year ago

Hi @folkvir Thank you for your detailed answer. Uppon further research I found out that bloom filters with known seeds are vulnerable towards pollution attacks link

So I will now switch to XOR-filters so that I can rebuild the filter with a different seed each time and the server only has to do 3 hashes per lookup.

Also I found out there is this new type of XOR-filter called "binary fuse filter" which has a smaller space size and is faster to be build. link. Is this something that could be added to this library?

folkvir commented 1 year ago

Hi! Yes the binary fuse filter is definitely something we can add in the library. (CC @Callidon ) For more info: https://github.com/hexops/fastfilter

Yes this is dangerous to keep the seed in your filters if those are public so you need to hide it someway. The XOR filter is very useful if you don't need to add/remove items but if you live with a few items only (< 1m entries). Then yes you can rebuuld it completely from scratch everytime.

pubkey commented 1 year ago

Does anyone know if XOR-Filters are also prune to polution attacks, like bloom filters are?