Closed solardiz closed 4 years ago
I verified the problem.
However, the problem is upstream, in fastfilter_cpp, not here. I am going to report it there.
Upstream issue at https://github.com/FastFilter/fastfilter_cpp/issues/8
Please take further discussion there.
using non-overlapping bitmasks instead of the reduce trick would have avoided this issue
At 500M, we need 6 bits of freedom for the in-word pointer and more than 26 bits of freedom to access the array of words (since the array length reaches 100M). So we need more than 26+6=32 bits. Unfortunately, we use a 32-bit input word (a) so we are bound to have problems.
This passes my testing now:
$ ./build_filter -V -o bloom12.bin -f bloom12 pwned-passwords-sha1-ordered-by-hash-v5.txt
I read 555278657 hashes in total (46.960 seconds).
Bytes read = 24470276290.
Constructing the filter...
Done in 56.360 seconds.
Checking for false negatives
Verified with success: no false negatives
estimated false positive rate: 0.320 percent
filter data saved to bloom12.bin. Total bytes = 832918016.
$ gzip -1c bloom12.bin | wc -c
833059411
Inputting the exact numbers 555278657 and 832918016*8 = 6663344128 into https://hur.st/bloomfilter/?n=555278657&p=&m=6663344128&k= I am getting:
n = 555278657
p = 0.00314235 (1 in 318)
m = 6663344128 (794.33MiB)
k = 8
The actually achieved 0.320% is only slightly worse than the expected 0.314%. Perhaps even this slight discrepancy could be avoided by using more elaborate hash functions, but this is an acceptably good result nevertheless.
Thanks for the check.
As discussed in #2:
Even though the array indexing is OK for up to 32 GiB, the information being used as array index and as bit index is redundant and the set of possible values of
a
is insufficient to use a Bloom filter larger than 4 gigabits. In fact, I expect that things start being sub-optimal already at smaller sizes (below 4 gigabits) because the information thatreduce
extracts overlaps with whatgetBit
extracts (using non-overlapping bitmasks instead of the reduce trick would have avoided this issue, but would have required the filter size to be a power of 2). The result is probably a higher than expected false positive rate, and the Bloom filter being compressible: