FastFilter / fastfilter_java

Fast Approximate Membership Filters (Java)
Apache License 2.0
238 stars 27 forks source link

Bloom filters untested for large (500M) inputs #22

Open lemire opened 4 years ago

lemire commented 4 years ago

It is likely that the Java Bloom filter implementation suffers from the same issue as the C++ implementation for large inputs https://github.com/FastFilter/fastfilter_cpp/issues/8

thomasmueller commented 4 years ago

I found fpp gets bad for 100 million keys and larger.

As far as I see, the solution for the C++ version was to use 128 bit multiplication, which isn't available in Java (at least, not fast). I found the following works for Java:

public void add(long key) {
    long hash = Hash.hash64(key, seed);
    long a = (hash >>> 32) | (hash << 32);
    long b = hash;
    for (int i = 0; i < k; i++) {
        data[Hash.reduce((int) (a >>> 32), arraySize)] |= 1L << a;
        a += b;
    }
}

public boolean mayContain(long key) {
    long hash = Hash.hash64(key, seed);
    long a = (hash >>> 32) | (hash << 32);
    long b = hash;
    for (int i = 0; i < k; i++) {
        if ((data[Hash.reduce((int) (a >>> 32), arraySize)] & 1L << a) == 0) {
            return false;
        }
        a += b;
    }
    return true;
}

Is there anything wrong with the above? Before the change:

size 2000000 BLOOM fpp: 0.008355 size: 2000000 bits/key: 10.0 add ns/key: 60.0 lookup 0% ns/key: 15.0 lookup 100% ns/key: 14.0 size 20000000 BLOOM fpp: 0.00825125 size: 20000000 bits/key: 10.0 add ns/key: 150.0 lookup 0% ns/key: 33.0 lookup 100% ns/key: 58.0 size 200000000 BLOOM fpp: 0.008869465 size: 200000000 bits/key: 10.0 add ns/key: 202.0 lookup 0% ns/key: 45.0 lookup 100% ns/key: 84.0

After the change: size 2000000 BLOOM fpp: 0.0082 size: 2000000 bits/key: 10.0 add ns/key: 45.0 lookup 0% ns/key: 16.0 lookup 100% ns/key: 14.0 size 20000000 BLOOM fpp: 0.00816025 size: 20000000 bits/key: 10.0 add ns/key: 142.0 lookup 0% ns/key: 33.0 lookup 100% ns/key: 61.0 size 200000000 BLOOM fpp: 0.008209775 size: 200000000 bits/key: 10.0 add ns/key: 177.0 lookup 0% ns/key: 46.0 lookup 100% ns/key: 95.0

Notice fpp for 200'000'000 is much better (0.008209775 instead of 0.008869465).

Speed it not affected it seems.

I can't test with >= 300 million keys.

lemire commented 4 years ago

It looks good to me.

thomasmueller commented 4 years ago

So the change is committed, but there is no test case yet. It's a bit hard to write one, as running it would require a lot of memory, and it would be slow.

It should be possible to write a test case with much less memory (by only remembering data of a subset of the array; a small window, and forget the rest). The test wouldn't use the real implementation. I'm not sure on how such a test could be made fast.