Callidon / bloom-filters

JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash
https://callidon.github.io/bloom-filters/
MIT License
369 stars 42 forks source link

CuckooFilter returns false negative #68

Open tri2820 opened 6 months ago

tri2820 commented 6 months ago

I'm using CuckooFilter to test the membership of the following items. The filter returns false - definitely not in the list - for 2 items. These cases are false negatives which should not happen.

Is this a bug or am I missing something? Version: "bloom-filters": "^3.0.1"

import { CuckooFilter } from "bloom-filters";
const items = [
  "https://www.youtube.com/watch?v=HJjxN05ewEc",
  "https://www.youtube.com/watch?v=BZNUo7orS3k",
  "https://www.youtube.com/watch?v=SD-McWZz_pk",
  "https://www.youtube.com/watch?v=De4QjH9fpgo",
  "https://www.youtube.com/watch?v=Hzko-cjHhTg",
  "https://www.youtube.com/watch?v=vqR-8lgOmBE",
  "https://www.youtube.com/watch?v=j6u0LH67YLk",
  "https://www.youtube.com/watch?v=B2z8ikGLRh8",
  "https://www.youtube.com/watch?v=N3ftBeP16TA",
  "https://www.youtube.com/watch?v=38RBRPwODUk",
  "https://www.youtube.com/watch?v=Ry8nSUfX6fY",
  "https://www.youtube.com/watch?v=-KrYohUJvYw",
  "https://www.youtube.com/watch?v=zRpl7Pr0fs4",
  "https://www.youtube.com/watch?v=uYYiypp6WaY",
  "https://www.youtube.com/watch?v=EPap21FBGbE",
  "https://www.youtube.com/watch?v=zN2_0WC7UfU",
  "https://www.youtube.com/watch?v=DNVwnkgTzbY",
];

const errorRate = 0.04; // 4 % error rate
const filter = CuckooFilter.from(items, errorRate);
console.log(filter.has("hello")); // return false (ok)
console.log(filter.has("https://www.youtube.com/watch?v=HJjxN05ewEc")); // return true (ok)
console.log(filter.has("https://www.youtube.com/watch?v=38RBRPwODUk")); // return false (false negative)
console.log(filter.has("https://www.youtube.com/watch?v=-KrYohUJvYw")); // return false (false negative)
folkvir commented 6 months ago

Hi! Thank your for sharing this behavior. That is actually weird, it seems that with a lot of rounds (100 000) we get a 66% of false negative and without the prefix https://www.youtube.com/watch?v= we get 33%. When dropping the error rate to 3% there is now no error at all what should be expected!. 🙄 Increasing the error rate up to 10% does not increase the number of false negative it stays at 33% without the prefix and 66% with the prefix. So for now no explanation on this, we need to investigate more but @Callidon and me are very busy so it may take longer than expected.

For now; try to use 0.03 and to remove the prefix https://www.youtube.com/watch?v.

Reproductible code on runkit (https://npm.runkit.com/bloom-filters)

const { CuckooFilter } = require("bloom-filters");
let items = [
  "https://www.youtube.com/watch?v=HJjxN05ewEc",
  "https://www.youtube.com/watch?v=BZNUo7orS3k",
  "https://www.youtube.com/watch?v=SD-McWZz_pk",
  "https://www.youtube.com/watch?v=De4QjH9fpgo",
  "https://www.youtube.com/watch?v=Hzko-cjHhTg",
  "https://www.youtube.com/watch?v=vqR-8lgOmBE",
  "https://www.youtube.com/watch?v=j6u0LH67YLk",
  "https://www.youtube.com/watch?v=B2z8ikGLRh8",
  "https://www.youtube.com/watch?v=N3ftBeP16TA",
  "https://www.youtube.com/watch?v=38RBRPwODUk",
  "https://www.youtube.com/watch?v=Ry8nSUfX6fY",
  "https://www.youtube.com/watch?v=-KrYohUJvYw",
  "https://www.youtube.com/watch?v=zRpl7Pr0fs4",
  "https://www.youtube.com/watch?v=uYYiypp6WaY",
  "https://www.youtube.com/watch?v=EPap21FBGbE",
  "https://www.youtube.com/watch?v=zN2_0WC7UfU",
  "https://www.youtube.com/watch?v=DNVwnkgTzbY",
];

const without_prefix = false;
const prefix = "https://www.youtube.com/watch?v="
const errorRate = 0.04; // 4 % error rate

if (without_prefix) {
    items = items.map(e => e.replace(prefix, ""));
}

const round = 100000
let c_false = 0

const filter = CuckooFilter.from(items, errorRate);
for (let i=0; i<round; i++) {
    let val = filter.has(`${without_prefix ? '' : prefix}HJjxN05ewEc`);
    if (!val) {
        c_false++;
    }

    val = filter.has(`${without_prefix ? '' : prefix}38RBRPwODUk`);
    if (!val) {
        c_false++;
    }
    val = filter.has(`${without_prefix ? '' : prefix}-KrYohUJvYw`);
    if (!val) {
        c_false++;
    }
}

console.log('Number of false positive:', c_false)
console.log('Rate:', c_false / (round * 3))
tri2820 commented 6 months ago

Hi @folkvir, from the code it seems for each round we are just checking membership using '.has'.

We already know 2 out of 3 hard-coded items there would return false, so it would always return 66% no matter how many rounds we run.

folkvir commented 6 months ago

All the 3 items in my loop are part of the filter. The .has should return true and should return a 4% of false negatives. I didn't check it yet but it could be due to the capacity of the filter. At 95% full the cuckoo filter can give 10% of errors. [1]