Advice on choosing bloom filter parameters

luxe commented 3 years ago

I have 863,049,256 keys whose format is 34 character alphanumeric:

Z6DauUBtzw8p77MLxy7VWYCKN92JfKiCK
AJe8BJYnJHp9DDUrvGgFwmn5oBjhUSgowr
114VNKGsr9M4ogvwjz6ESNUqYdroGyht7r
etc..

Do you have any recommendations on bloom filter parameters? In particular these:

/* k-mer size */
const unsigned k = //?;

/* number of Bloom filter hash functions */
const unsigned numHashes = //?;

/* size of Bloom filter (in bits) */
const unsigned size = //?;

It takes a long time for me to initialize the filter, so exerpimenting with different configs takes time.
Curious if you have any insight into a good starting configuration.

luxe commented 3 years ago

Oh wait. Does this data structure only work with DNA? haha

kmnip commented 3 years ago

The input k-mers are hashed with ntHash. So, they should be nucleotide sequences, which are strings of A, C, G, T, and U.

JustinChu commented 3 years ago

The Bloom filter can technically be used by itself as BloomFilter.hpp doesn't actually include anything from ntHash. The only weird thing is that we store the k-mer size when the datastructure is serialized but it can technically just be set to anything and not used. The BloomFilter.hpp by itself only uses hash values directly when inserting/querying (doesn't come with its own hash function).

To parameterize the bloom filter I would use this function after you pick a false positive rate (FPR) to get the optimal number of hashes (optimal relative to memory usage). https://github.com/bcgsc/btl_bloomfilter/blob/0f19567306806bc3890e31f3f3d01285b9b4d158/BloomFilter.hpp#L419 And this constructor once you get the recommend number of hash functions. https://github.com/bcgsc/btl_bloomfilter/blob/0f19567306806bc3890e31f3f3d01285b9b4d158/BloomFilter.hpp#L83

bcgsc / btl_bloomfilter

Advice on choosing bloom filter parameters #38