FastFilter / fastfilter_cpp

Fast Approximate Membership Filters (C++)
Apache License 2.0
261 stars 24 forks source link

Bloom filters broken for large inputs (~500M entries) #8

Closed lemire closed 4 years ago

lemire commented 4 years ago

For large inputs, the false positive rate increases... At 500M we can consider it broken. As far as I can tell, it is not an overflow issue but rather a hashing issue.

$ ./bloomunit 1000
array size = 1000
estimated false positive rate: 0.296 percent
$ ./bloomunit 10000
array size = 10000
estimated false positive rate: 0.296 percent
$ ./bloomunit 10000000
array size = 10000000
estimated false positive rate: 0.348 percent
$ ./bloomunit 500000000
array size = 500000000
estimated false positive rate: 1.794 percent
#include <sstream>

#include "bloom.h"

int main(int argc, char **argv) {
  if(argc < 2) {
    printf("Please pass a number as a paramater.\n");
    return EXIT_FAILURE;
  }
  size_t array_size = atoll(argv[1]);//500 * 1000 * 1000;
  printf("array size = %zu\n", array_size);
  using Table = bloomfilter::BloomFilter<uint64_t, 12, false, SimpleMixSplit>;
  Table table(array_size);
  for (size_t i = 0; i < array_size; i++) {
    table.Add(i);
  }
  size_t matches = 0;
  size_t volume = 100000;
  for(size_t t = 0; t < volume; t++) {
    if(table.Contain(t + array_size) == bloomfilter::Ok) {
      matches++;
    }
  }
  printf("estimated false positive rate: %.3f percent\n", matches * 100.0 / volume);
}

I expect that the problem is explained by this comment...

   // use the fact that reduce is not very sensitive to lower bits of a
    data[reduce(a, this->arrayLength)] |= getBit(a);

We have that a is a 32-bit value. If we have 500M inputs, then arrayLength should be about 100M or 2**27. So, roughly speaking reduce(a, this->arrayLength) uses the top 27 bits... while getBit(a) needs 6 bits. 27+6 = 33... We are exceeding the 32 bit input.

lemire commented 4 years ago

cc @thomasmueller @solardiz

solardiz commented 4 years ago

Your analysis looks correct to me.

My suggestion that "using non-overlapping bitmasks instead of the reduce trick would have avoided this issue" was only for the case when 32 bits are still enough (as I had mentioned, for filters of up to 4 gigabits in size).

lemire commented 4 years ago

I have an easy fix coming up.

lemire commented 4 years ago

Fixed by moving to 64-bit:

https://github.com/FastFilter/fastfilter_cpp/commit/f9c32cfa7726d5206ead33be5de1df43ef254fc5