FastFilter / fastfilter_cpp

Fast Approximate Membership Filters (C++)
Apache License 2.0
261 stars 24 forks source link

The false positive rate of Bloom Filter is much larger than theoretical value #32

Closed Ln7-best closed 1 year ago

Ln7-best commented 1 year ago

I randomly generated 65536 64-bits keys and 1000000 negtive queries. And I allocated 18*65536 bits for 65536 keys stored in Bloom FIlter. The number of hash functions is 12. While the output shows that the bloom filter produces 25350 false positives (fpr=0.02535), the number of which should be about 175 (fpr=0.000175) in theory. Is this normal? Here is my test code:

#include<iostream>
#include"bloom.h"
#include"utils.h"
int main() {
    bloomfilter::BloomFilter<uint64_t, 18, false> bf(65536);
    std::vector<util::Key> insert_keys;
    std::vector<util::Key> lookup_keys;
    util::read_workload("insert_operation.txt", insert_keys);
    util::read_workload("lookup_operation.txt", lookup_keys, 1);
    for (uint64_t i = 0; i < insert_keys.size(); i++)
    {
        bf.Add(insert_keys[i].val);
    }
    uint64_t cnt = 0;
    for (uint64_t i = 0; i < lookup_keys.size(); i++)
    {
        int ans = bf.Contain(lookup_keys[i].val);
        if (ans != 1)
            cnt++;
    }
    std::cout << cnt << std::endl;
}
lemire commented 1 year ago

I cannot verify the issue. Please run the following program:

#include "bloom.h"
#include <random>
#include <iostream>

bool check_bloom() {
    std::cout << "testing Bloom" << std::endl;
    bloomfilter::BloomFilter<uint64_t, 18, false> bf(65536);
    for (uint64_t i = 0; i < 65536; i++) {
        bf.Add(i);
    }
    std::mt19937 gen;
    std::discrete_distribution<> bytes_count;
    std::uniform_int_distribution<uint64_t> val_64bit{0x0, 0xffffffffffffffff};
    uint64_t cnt = 0;
    uint64_t count = 50000000;
    for (uint64_t i = 0; i < count; i++)
    {
        int ans = bf.Contain(val_64bit(gen));
        if (ans != 1) { cnt++; }
    }
    double rate = double(cnt)/count;
    std::cout << double(cnt)/count << std::endl;
    return rate < 0.00018;
}

int main() {
    return check_bloom() ? EXIT_SUCCESS : EXIT_FAILURE;
}
Ln7-best commented 1 year ago

Thanks for the reply! I just run the test program you mentioned. The output still shows that the false positive rate is very high. The return value of the main function is 1. 屏幕截图 2023-03-07 230703

lemire commented 1 year ago

@Ln7-best Thanks for your feedback. If you believe that there is a bug, please consider providing a pull request, or at least code to reproduce the tests in our continuous-integration harness.

lemire commented 1 year ago

Note that it is not possible for us to fix an issue if we cannot reproduce it. We ran tests under Linux with GCC and macOS with LLVM. Ideally, you should investigate the nature of the issue affecting you.

Ln7-best commented 1 year ago

It is weird that I run the program under Linux with gcc and the result is very close to the theoretical value. It seems the bug only exists under windows with MSVC. I need to check this later. Thank you so much.