ben-manes / caffeine

A high performance caching library for Java
Apache License 2.0
15.87k stars 1.6k forks source link

Very high false positive rate observed for BloomFilter implementation. #85

Closed ashish0x90 closed 8 years ago

ashish0x90 commented 8 years ago

I observed a very high false positive rate with the current implementation of BloomFilter, at times as high as 100%. My test code and results are given below. I also found out what can be changed to fix it, although not completely sure why my fix worked. I thought maybe bitmask method's output has some bias, but upon further inspection it seems alright. I hope I am using the APIs correctly and don't have some other bug in my test code. Would be great if someone can review and let me know. tks! P.S.: I understand that BloomFilter code might be internal to caffeine, but just want to highlight my observation.

Line: https://github.com/ben-manes/caffeine/blob/master/simulator/src/main/java/com/github/benmanes/caffeine/cache/simulator/admission/bloom/BloomFilter.java#L166

Current code:

static long bitmask(int hash) {
    return 1L << ((hash >>> 8) & INDEX_MASK);
  }
Number of Insertions False positives(% ) True positives
1024 27 (2.636719%) 1024
4096 640 (15.625000%) 4096
16384 15213 (92.852783%) 16384
65536 65536 (100.000000%) 65536
262144 262144 (100.000000%) 262144
1048576 1048576 (100.000000%) 1048576

New implementation:

static long bitmask(int hash) {
    return 1L << ((hash >>> 24) & INDEX_MASK);
  }
Number of Insertions False positives(%) True positives
1024 15 (1.464844%) 1024
4096 96 (2.343750%) 4096
16384 391 (2.386475%) 16384
65536 1598 (2.438354%) 65536
262144 6326 (2.413177%) 262144
1048576 25600 (2.441406%) 1048576

Test method:

    public void bloomFilterTest() {
        System.out.println("Number of Insertions\tFalse positives(%)\tTrue positives");
        for (int capacity = 2 << 10; capacity < 2 << 22; capacity = capacity << 2) {
            long[] input = new Random().longs(capacity).distinct().toArray();
            BloomFilter bf = new BloomFilter(input.length / 2, new Random().nextInt());
            int truePositives = 0;
            int falsePositives = 0;
            int i = 0;
            // Add only first half of input array to bloom filter
            for (; i < (input.length / 2); i++) {
                bf.put(input[i]);
            }
            // First half should be part of the bloom filter
            for (int k = 0; k < i; k++) {
                truePositives += bf.mightContain(input[k]) ? 1 : 0;
            }
            // Second half shouldn't be part of the bloom filter
            for (; i < input.length; i++) {
                falsePositives += bf.mightContain(input[i]) ? 1 : 0;
            }
            System.out.format("%d\t\t%d(%f%%)\t\t%d\n",
                input.length / 2, falsePositives,
                ((float) falsePositives / (input.length / 2)) * 100, truePositives);
        }
    }
ben-manes commented 7 years ago

That was my assumption, but I also didn't want to send an introduction email that bounced.

Likewise, I'm hoping that we don't need to run in split mode to take advantage of their ideas. Instead I think one side is the previous hit rate and the other side is the new adjustment. If the confidence function is positive then we grow else we shrink, perhaps with a little jitter if it fails to adapt as the workload changes.

But I don't fully grok Chernoff Bound either. Since the paper states it is very simply to code and is low cost, it sounds intriguing.

When I had initially proposed hill climbing, Gil recommended two ghost caches to determine which direction was favorable. That would have avoided the noise and he hoped it would prove if the idea worked. If so, then we'd have an optimal version to judge against as we tried to find more compact alternatives to the ghost caches. Its a good idea, but I didn't get around to trying it.

Maaartinus commented 7 years ago

Thank you for the email (and I surely agree that asking is better than bouncing).

Likewise, I'm hoping that we don't need to run in split mode to take advantage of their ideas.

What you did is like splitting in time rather than splitting in space. One disadvantage is that the number of samples in the past part is fixed. Another disadvantage is that we can never say that "smaller is better" as any gain may be due to a past change in the opposing direction. There are advantages, too, and I hope, the ideas from the paper may work, somehow.

Chernoff Bound

The whole implementation is in Fig. 2. I rewrote it for myself as

scoreOfGreater += (isHit == isGreaterPart) ? +1 : -1;
++total;
if (scoreOfGreater * scoreOfGreater > someConstant * total) {
  param += (scoreOfGreater > 0) ? +stepSize : -stepSize;
  scoreOfGreater = total = 0;
}

It's trivial and nicely eliminates the sample size threshold (at the expense of someConstant).

two ghost caches

The split cache does about the same for free. It's not exactly the same and it may introduce some bias due to how it gets split. As you wrote in the email, there may be a problem when one partition is much bigger than the other. There may be an even worse problem, when one partition is much luckier than the other (a few frequently hit entries could possible cause this). The ghost caches don't suffer such problems.

A funny idea: Inside of the real cache, we can approximately simulate a slightly smaller cache by simply marking entries which would get lost if the cache was smaller (I hope, we can). Imagine one cache having the same LRU size and a slightly smaller LFU size. Imagine a second cache which is smaller in the other way. We could simulate both caches and by comparing their performances determine which part should grow.

senderista commented 7 years ago

Hi, I'm barely familiar with the Caffeine code, but I noticed a couple of things in this fascinating discussion:

  1. Clearing the doorkeeper bloom filter all at once creates spurious "one-hit wonders", as you noticed. You could address this problem by making the aging process more gradual. There are a bunch of time-decaying bloom filter variants around, but perhaps the simplest aging bloom filter is this implementation: http://ieeexplore.ieee.org/abstract/document/5066970/ (I know, I know, I have a copy if you want it.) The algorithm in this paper is very simple:

    if x in cache1:
    result = true
    else:
    if x in cache2:
    result = true
    else:
    result = false
    cache1.add(x)
    if cache1.full():
    cache2.flush()
    swap(cache1, cache2)
    cache1.add(x)
    return result
  2. Since you don't really care about exact frequency counts, I wonder if you need frequencies for all cache entries at all. What if you just maintained a top-k heavy hitter structure (Space-Saving is probably the simplest), and only made frequency-based eviction decisions based on membership in this structure? (You have to choose k of course, but that can be changed pretty easily on the fly with the Stream-Summary data structure, so you might be able to incorporate that parameter into an online optimization algorithm like a hill climber.)  Anyway, as I said I haven't studied the Caffeine code in much detail, so I apologize if these suggestions don't make sense in this context.

ben-manes commented 7 years ago

Thanks for jumping in, Tobin!

  1. That's a good observation. Its not clear if by using two filters the memory overhead would be the same as a single one experimented with, or be double. There would also be a little more overhead to insert/query into both. I also don't have a good equation for the potential space savings by using the doorkeeper. It appeared that the larger the cache, the larger the savings factor could be made. In general it was a 2x reduction to the CMS, but a 4x was okay for large traces. Perhaps more for even bigger caches. Ideally we'd have an equation to adjust by and know how much memory was reduced.

  2. That may work, so its worth exploring. The idea of TinyLFU is to estimate the frequency of all items in the data set with aging, and thereby compare to retain the top-k heavy hitters. I don't know if some of the alternative stream summary structures would have higher overhead, as that Space-Saving implementation doesn't appear very compact.


So far overhead of the CM Sketch hasn't been a complaint, so I'm weary of spending too much effort optimizing it. I'd be happy for contributions as its a worthy endeavor, but energy might be better spent elsewhere. That's why I think adding an adaptive window and timer wheels are more impactful for a contributor to spearhead.

fyi, I put some of this discussion in a slide deck with other design overview details. Hopefully that compensates the article and design doc.

senderista commented 7 years ago

The slides are a nice intro, thanks!

Something I forgot to mention is that you can incorporate exponential decay into the Space-Saving algorithm in a very natural way: https://pdfs.semanticscholar.org/8e44/278c1da454600e88be3065130fbac4360806.pdf

Our algorithm, a modified version of the “Space-saving” algorithm, tracks a set of O(1) pairs of item names and counters, with the counters initialized to zero. For each item x_i in the stream, we see if there is currently an (item, counter) pair for that item. If so, we update the quantity of w_i exp(λt_i), and add this to the counter associated with x_i. Otherwise, we add the same quantity, w_i exp(λt_i), to the smallest counter (breaking ties arbitrarily), and set the item associated with the counter to x_i. Pseudo-code is in Figure 1. To find the heavy hitters, visit each item stored in the data structure, item[i], estimate its decayed weight at time t as exp(−λt) count[i], and output item[i] if this is above φD.

Input: item x_i, timestamp t_i, weight w_i, decay factor λ
Output: Current estimate of item weight
if ∃j. item[j] = x_i;
then j ← item^{−1}(x_i)
else j ← arg min_k(count[k]);
item[j] ← x_i;
count[j] ← count[j] + w_i exp(λt_i)
return (count[j] ∗ exp(−λt_i))

The decay constant λ (half-life: ln(2)/λ) is another parameter to guess, of course, but maybe it could be learned online as well.

ben-manes commented 7 years ago

FYI,

Gil committed an adaptive sketch, AdaptiveResetCountMin4. I haven't experimented with it much, but Gil's early evaluation shows good results.

ben-manes commented 7 years ago

I prototyped a simple hierarchical timer wheel. This needs a lot of work regarding testing, optimizations, and integration into the cache. But I thought it was a good starting point for us to iterate on.

There is also a paper by the Technion on replacing the sketch with a table-based approach (e.g. cuckoo filter). That avoids the replaces the reset interval with a random replacement when the table exceeds a threshold. See TinyCacheSketch for the implementation they added and evaluated against (by setting sketch =tiny-table` in the configuration file). Since its not published yet I shouldn't link it, but send me an email if you would like to read their analysis.

They are also working on the adaptive sketch and hope to have a paper ready in May. I haven't spent much time on it, but they're getting promising results so far.