Callidon / bloom-filters

JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash
https://callidon.github.io/bloom-filters/
MIT License
359 stars 41 forks source link

Question about selecting the size and hash values for a classic Bloom filter #45

Closed Eric24 closed 2 years ago

Eric24 commented 2 years ago

Is there any guidance on optimal values for size and hash when creating a classing Bloom filter (or similar for the other types)? Also, what specifically does "size" refer to (number of entries, bytes, etc.)?

Callidon commented 2 years ago

Hi 👋

Is there any guidance on optimal values for size and hash when creating a classing Bloom filter (or similar for the other types)?

According to scientific literature, there are several ways of computing the optimal number of hash functions, depending on what you want to optimize. For example, if you know the maximum number of items that the filter will hold, you can optimise for a precise error rate. That's exactly what the BloomFilter.create function does. Internally, we use the optimalHashes function to compute this value.

For most usage, we do not recommend to create your bloom filter directly with the constructor, and instead use the BloomFilter.create and BloomFilter.from functions. But if these functions do not fit your usage of the lib, then I suggest that you look for scientific literature on this topic, to pick the right formula to optimise the number of hashes functions for your case.

Also, what specifically does "size" refer to (number of entries, bytes, etc.)?

For a BloomFilter, size is the number of bytes used by the data structure to store data. In the state of art, it is also defined as "the number of cells" but it's the same thing. For example, a bloom filter of size 32 will use 32 bits to store data.

Eric24 commented 2 years ago

@Callidon - Thank you for the quick and detailed response!

I think for my particular use-case I can't use the create or from methods, because I won't know the full list of items in advance. My use-case is to check for the possibility that a key already exists in a database as it's being imported. There could be millions of records (my example set has about 4 million). Each should have a unique key, so these are my items, added to the filter as they are imported. The filter is also checked to see if the key might exist (meaning that there might be a duplicate key) before each is imported (if the filter returns a positive result, that record is handled differently vs knowing that it doesn't exist, which allows for faster and more efficient handling).

Given that, it seems like the constructor is the way to go, maybe with something like a 32-bit size and a "good starting guess" of the number of hashes, which I could then "test" after a sample import? Do you recommend a different approach? Do you recommend a different filter type than the basic Bloom?

Callidon commented 2 years ago

From what you are describing, it looks similar to what distributed databases like HBase or Cassandra do: they use a bloom filter for caching index access per key, to avoid cost access to disk. For that, a classic bloom filter should do the job just right.

To calibrate the filter size and number of hash functions, I think you should benchmark it, using samples of data with different sizes and contents. It will be a bit tedious, but I don't know a better way to do it, since there's too many unknown in your equation.

folkvir commented 2 years ago

If you can partition your data then @Callidon is right. Multiple classic Bloom Filters could do what you want and a benchmark is needed for calibrating the parameters. Otherwise one filter with .create(dbSize, errorRateDesired) should do the trick. Be careful, a low error rate will increase significantly the size of the filter. Just an example, consider this as a beginning of a benchmark with a fixed error rate of 1% with a Bloom Filter calibrated for using a maximum number of elements:

const printBl = (bl) => console.log(bl._nbHashes, bl._filter.length, process.memoryUsage().heapUsed / 1024 / 1024)
const sizes = [10, 100, 1000, 10000, 100000, 1000000, 10000000]
for (let key in sizes) {
  const bl = BloomFilter.create(sizes[key], 0.01)
  printBl(bl)
}

First is the number hash functions used, second the number of bits in the filter with the current implementation, third, the heap used in MB.

7 96 14.244026184082031
7 959 14.27655029296875
7 9586 14.409317016601562
7 95851 15.159866333007812
7 958506 21.44140625
7 9585059 94.41828155517578
7 95850584 1582.5356674194336

I'll close the issue since it is not one but advices. Feel free to send new messages.

EDIT: those results were done with node v16.13.1, 2,3 GHz Intel Core i5 double core, 8G 2133 MHz LPDDR3