Open DanielHeath opened 5 years ago
Cuckoofilters are not bloom filters, so the parameters are different and the formulas for bloom filters won't work.
For cuckoofilters, those parameters are the capacity
of the filter, BUCKET_SIZE
and FINGERPRINT_SIZE
, which all affect the utilization and the false positive rate. For rust-cuckoofilter, only the first is a run-time parameter, the other two are compile-time parameters. I guess you used a 1Gi capacity filter? Probably the 1 byte default fingerprint size is just too low for your target false positive rate. Due to how this rust version is implemented though, you directly need to go to 2 byte, while other implementations support intermediate steps like 12 bits.
“I've been taking the first 64 bits of the SHA1 hash as the key.” - did you check how many false positives that introduces by itself, i.e. how many of thos hashes share their first 64bits? Might be beneficial to put the whole hashes into the filter.
Thanks for the pointer to increasing fingerprint size - will give it a shot.
I'm detecting false positives by generating an equal number of passwords not found in the dataset.
The filter (currently) only takes a u64; I should be able to expand that to take the whole hash, though.
I've tried setting FINGERPRINT_SIZE
to 2 (bytes).
It grew the exported filter size (gzipped) by a factor of 4, and increased the false positive rate to 99.99%.
I suspect it's not as simple as changing that one constant. Any tips for where else I should be looking?
It should be that easy. That sounds like you found a undiscovered bugs in the code. Due to having to compile a custom version, I suspect that code was seldomly tested with other values than a single byte.
Eventually, we will be able to use const generics for the fingerprint size so that you don't have to edit the code anymore to change it, and therefore to get it some test coverage etc.
I ran a check for collisions on the first 16 chars of the sha1 hash - none in the dataset I have.
I've tried setting
FINGERPRINT_SIZE
to 2 (bytes).It grew the exported filter size (gzipped) by a factor of 4, and increased the false positive rate to 99.99%.
@DanielHeath I was just checking through all the old issues and this reminded me of one of the changes I made. This commit (deadf778ecbbc5075e20195ce9561eaf3980125d) may be why changing the fingerprint size didn't have the desired impact.
I'm using rust-cuckoofilter to check entries in the 'Have I Been Pwned' master list (551509767 SHA1 hashes).
I've been taking the first 64 bits of the SHA1 hash as the key.
According to this bloom filter calculator, a 1gb bloom filter should be able to give me a 0.1% false positive rate on the full dataset.
However, when I load this data into rust-cuckoofilter, I get >2% false positives with a 1gb cuckoo filter.
I'm happy to provide any assistance with diagnosing further.