Open luxe opened 3 years ago
Oh wait. Does this data structure only work with DNA? haha
The input k-mers are hashed with ntHash. So, they should be nucleotide sequences, which are strings of A, C, G, T, and U.
The Bloom filter can technically be used by itself as BloomFilter.hpp doesn't actually include anything from ntHash. The only weird thing is that we store the k-mer size when the datastructure is serialized but it can technically just be set to anything and not used. The BloomFilter.hpp by itself only uses hash values directly when inserting/querying (doesn't come with its own hash function).
To parameterize the bloom filter I would use this function after you pick a false positive rate (FPR) to get the optimal number of hashes (optimal relative to memory usage). https://github.com/bcgsc/btl_bloomfilter/blob/0f19567306806bc3890e31f3f3d01285b9b4d158/BloomFilter.hpp#L419 And this constructor once you get the recommend number of hash functions. https://github.com/bcgsc/btl_bloomfilter/blob/0f19567306806bc3890e31f3f3d01285b9b4d158/BloomFilter.hpp#L83
I have 863,049,256 keys whose format is 34 character alphanumeric:
Do you have any recommendations on bloom filter parameters? In particular these:
It takes a long time for me to initialize the filter, so exerpimenting with different configs takes time.
Curious if you have any insight into a good starting configuration.