Bloom filter is suspiciously compressible [upstream bug: fastfilter_cpp]

solardiz commented 4 years ago

As discussed in #2:

Even though the array indexing is OK for up to 32 GiB, the information being used as array index and as bit index is redundant and the set of possible values of a is insufficient to use a Bloom filter larger than 4 gigabits. In fact, I expect that things start being sub-optimal already at smaller sizes (below 4 gigabits) because the information that reduce extracts overlaps with what getBit extracts (using non-overlapping bitmasks instead of the reduce trick would have avoided this issue, but would have required the filter size to be a power of 2). The result is probably a higher than expected false positive rate, and the Bloom filter being compressible:

$ ./build_filter -V -o bloom12.bin -f bloom12 pwned-passwords-sha1-ordered-by-hash-v5.txt
I read 555278657 hashes in total (52.480 seconds).
Bytes read = 24470276290.
Constructing the filter...
Done in 54.090 seconds.
Checking for false negatives
Verified with success: no false negatives
filter data saved to bloom12.bin. Total bytes = 832918016. 
$ gzip -1c bloom12.bin | wc -c
697779389

lemire commented 4 years ago

I verified the problem.

However, the problem is upstream, in fastfilter_cpp, not here. I am going to report it there.

lemire commented 4 years ago

Upstream issue at https://github.com/FastFilter/fastfilter_cpp/issues/8

Please take further discussion there.

using non-overlapping bitmasks instead of the reduce trick would have avoided this issue

At 500M, we need 6 bits of freedom for the in-word pointer and more than 26 bits of freedom to access the array of words (since the array length reaches 100M). So we need more than 26+6=32 bits. Unfortunately, we use a 32-bit input word (a) so we are bound to have problems.

lemire commented 4 years ago

Fixed with commit https://github.com/FastFilter/FilterPassword/commit/1ba514e2407e08ab6e0b2261b526c60fbec4629a

solardiz commented 4 years ago

This passes my testing now:

$ ./build_filter -V -o bloom12.bin -f bloom12 pwned-passwords-sha1-ordered-by-hash-v5.txt
I read 555278657 hashes in total (46.960 seconds).
Bytes read = 24470276290.
Constructing the filter...
Done in 56.360 seconds.
Checking for false negatives
Verified with success: no false negatives
estimated false positive rate: 0.320 percent
filter data saved to bloom12.bin. Total bytes = 832918016.
$ gzip -1c bloom12.bin | wc -c
833059411

Inputting the exact numbers 555278657 and 832918016*8 = 6663344128 into https://hur.st/bloomfilter/?n=555278657&p=&m=6663344128&k= I am getting:

n = 555278657
p = 0.00314235 (1 in 318)
m = 6663344128 (794.33MiB)
k = 8

The actually achieved 0.320% is only slightly worse than the expected 0.314%. Perhaps even this slight discrepancy could be avoided by using more elaborate hash functions, but this is an acceptably good result nevertheless.

lemire commented 4 years ago

Thanks for the check.

FastFilter / FilterPassword

Bloom filter is suspiciously compressible [upstream bug: fastfilter_cpp] #4