Very high false positive rate observed for BloomFilter implementation.

ben-manes / caffeine

A high performance caching library for Java

Apache License 2.0

15.68k stars 1.59k forks source link

Very high false positive rate observed for BloomFilter implementation. #85

Closed ashish0x90 closed 8 years ago

ashish0x90 commented 8 years ago

I observed a very high false positive rate with the current implementation of BloomFilter, at times as high as 100%. My test code and results are given below. I also found out what can be changed to fix it, although not completely sure why my fix worked. I thought maybe bitmask method's output has some bias, but upon further inspection it seems alright. I hope I am using the APIs correctly and don't have some other bug in my test code. Would be great if someone can review and let me know. tks! P.S.: I understand that BloomFilter code might be internal to caffeine, but just want to highlight my observation.

Line: https://github.com/ben-manes/caffeine/blob/master/simulator/src/main/java/com/github/benmanes/caffeine/cache/simulator/admission/bloom/BloomFilter.java#L166

Current code:

static long bitmask(int hash) {
    return 1L << ((hash >>> 8) & INDEX_MASK);
  }

Number of Insertions	False positives(% )	True positives
1024	27 (2.636719%)	1024
4096	640 (15.625000%)	4096
16384	15213 (92.852783%)	16384
65536	65536 (100.000000%)	65536
262144	262144 (100.000000%)	262144
1048576	1048576 (100.000000%)	1048576

New implementation:

static long bitmask(int hash) {
    return 1L << ((hash >>> 24) & INDEX_MASK);
  }

Number of Insertions	False positives(%)	True positives
1024	15 (1.464844%)	1024
4096	96 (2.343750%)	4096
16384	391 (2.386475%)	16384
65536	1598 (2.438354%)	65536
262144	6326 (2.413177%)	262144
1048576	25600 (2.441406%)	1048576

Test method:

    public void bloomFilterTest() {
        System.out.println("Number of Insertions\tFalse positives(%)\tTrue positives");
        for (int capacity = 2 << 10; capacity < 2 << 22; capacity = capacity << 2) {
            long[] input = new Random().longs(capacity).distinct().toArray();
            BloomFilter bf = new BloomFilter(input.length / 2, new Random().nextInt());
            int truePositives = 0;
            int falsePositives = 0;
            int i = 0;
            // Add only first half of input array to bloom filter
            for (; i < (input.length / 2); i++) {
                bf.put(input[i]);
            }
            // First half should be part of the bloom filter
            for (int k = 0; k < i; k++) {
                truePositives += bf.mightContain(input[k]) ? 1 : 0;
            }
            // Second half shouldn't be part of the bloom filter
            for (; i < input.length; i++) {
                falsePositives += bf.mightContain(input[i]) ? 1 : 0;
            }
            System.out.format("%d\t\t%d(%f%%)\t\t%d\n",
                input.length / 2, falsePositives,
                ((float) falsePositives / (input.length / 2)) * 100, truePositives);
        }
    }

ben-manes commented 8 years ago

Thanks! I'm pleasantly surprised that you played with the simulator and its Bloomfilter. Since that was for policy experiments, I only eyeballed to see if the traces were similar when moving from Guava's. Its not a perfect filter because its hard coded at only 4 hashes to allow avoiding another computation for the CountMin sketch. I didn't see this because the adaptive policy's usage is a small BF that is rarely populated and cleared regularly.

Thankfully I haven't brought this into Caffeine's caches yet, where unit tests would be required and this is a really good test and fix. I wonder if there are test cases we can borrow from other bloom filter implementations as well.

I was trying to use the offset to reuse the hash, where its also being modulated to find the array index and the shifted hash to find the bit index. I had thought that since the overlap wouldn't have a big effect because the array would be large, the bit index is a mod using 7-bits, and a shift would result in different results. It makes sense that the quick hash functions might result in more collisions in the middle bits, though I really hadn't expected it to be as dramatic.

I'm really glad you went ahead, experimented, and fixed it. I think using the higher order bits was a good idea in general, too.

ben-manes commented 8 years ago

I ran your test as a comparison with stream-lib and Guava's. It shows a nice trend that after your fix the implementation is comparable. Guava's follows the theoretical without any optimizations, so it stays very close to 3% FPP by using 5 murmur hashes and uses the minimal space. Stream is a little more lax, using 4 murmur hashes with a slightly higher space. Caffeine's uses 4 balanced hashes with higher space due to being power-of-two sized.

Across multiple runs the numbers vary, but all stay close to the desired FPP rate. I couldn't find any exhaustive test suites using different distributions, e.g. to try to detect a weakness in the hash functions. After your fix I think this is fairly robust.

A nice aspect of how Caffeine will use the bloom filter is to reduce space. The power-of-two sizing might make the bloom filter larger, but as a "doorkeeper" it will reduce the CountMinSketch by probably 50% (used to ignore one-hit wonders). This also means a higher FPP would be okay because we're merely trying to compactly query if one entry is more frequent than another. Most often that compares a hot vs. cold entry, which have a wide gap in their frequency counts.

Number of Insertions    False positives(%)  True positives
(s) 1024        22(2.148438%)       1024
(c) 1024        21(2.050781%)       1024
(g) 1024        26(2.539063%)       1024

(s) 4096        102(2.490234%)      4096
(c) 4096        95(2.319336%)       4096
(g) 4096        122(2.978516%)      4096

(s) 16384       403(2.459717%)      16384
(c) 16384       397(2.423096%)      16384
(g) 16384       489(2.984619%)      16384

(s) 65536       1581(2.412415%)     65536
(c) 65536       1612(2.459717%)     65536
(g) 65536       2019(3.080750%)     65536

(s) 262144      6212(2.369690%)     262144
(c) 262144      6294(2.400970%)     262144
(g) 262144      7818(2.982330%)     262144

(s) 1048576     25130(2.396584%)    1048576
(c) 1048576     25503(2.432156%)    1048576
(g) 1048576     31529(3.006840%)    1048576

Maaartinus commented 8 years ago

I had thought that since the overlap wouldn't have a big effect because the array would be large, the bit index is a mod using 7-bits, and a shift would result in different results.

IIUIC you were using the same bits for selecting a word and for selecting the bit, i.e., using only one bit per word: https://github.com/ben-manes/caffeine/blob/master/simulator/src/main/java/com/github/benmanes/caffeine/cache/simulator/admission/bloom/BloomFilter.java#L97

Using the highest 8 bits works well, as long as table.length <= 1<<24. Bigger tables are no better as a table with 2**32 bits and should never be created. If you really needed them, then you can't use int item.

In constructor, you do checkArgument(randomSeed != 0), but a value of 1<<31 is a disaster, too. I'd either require or enforce randomSeed to be odd.

In ensureCapacity you replace the table in case it needs to grow. This leads to false negatives, which is pretty bad as the method is public and can be called anytime by anyone. I'd make it private as resizing Bloom Filters is no good idea (it can be done, but costs more than an initially properly sized filter).

The power-of-two sizing might make the bloom filter larger

Sure, larger, faster and also better.

I think that the power-of-two sizing is nearly always a good idea, as all you lose is the ability to fine tune the FPP. And I can hardly imagine a case where the exact FPP matters. Spending a bit more memory and getting a bit better rate or the other way round is nearly always fine.

ben-manes commented 8 years ago

I was using the same bits, but modulating based on a different bytes. You're right that using the highest byte is better by avoiding any overlap unless the table is unreasonably large.

a value of 1<<31 is a disaster, too

Oh that's a good point!

the method is public and can be called anytime by anyone

Note that this is a utility for the simulator when experimenting with policies. When brought into Caffeine the class would be package private and resized when the cache is (rare event). As its usage in TinyLFU would clear the BF regularly and false negatives are less impactful (temporarily mispredicting "hotness") it should be okay.

Spending a bit more memory and getting a bit better rate or the other way round is nearly always fine.

Yep, 100% agree.

ashish0x90 commented 8 years ago

Hi Ben, Thanks for the prompt reply full of interesting details :). Also thanks for running comparision tests with other alternatives and sharing the results. Glad to see the new numbers. Yup you're right, seems like shared hash indeed was the issue.

I also agree with Maaartinus, IMHO better to keep BloomFilter capacity immutable. If we need a bigger bloomFilter later, can easily create a new bloomFilter with new capacity instead of resizing existing bloomFilter (we lose the existing entries anyway as part of resize process) unless there is a good performance reason I might be missing here? Anyhow, if this becomes package private than it doesn't really matter.

ben-manes commented 8 years ago

I agree with both of you on disallowing a resize for a public API, and I'll make it private in the simulator too. Since I don't view this as a public API, I am a little more forgiving.

Since the simulator is similar to test code I unfortunately do not always put in enough effort to verify the utilities. I would have if deciding to promote an idea into the cache and hopefully would have found this issue. I'm very happy about your input because it avoids mistakes in experiments, I might have let this oversight slip through, and anyone can take the code out of the project. I'm not sure why you decided to play with it, but I'm glad you did.

When brought into Caffeine's cache the reason for a resize vs new instance is subtle. There are a lot of possible cache configurations, so the variations are code generated to minimize space. The expected insertions might change when either the cache is manually resized (roughly never) or when using a weighted maximum size. In that case the number of entries might vary widely, so ensuring the capacity is necessary. Performing it as a re-initialization vs. new instance is about code organization and ease of use. It definitely is not meant to imply that it implements a Scalable BloomFilter.

Maaartinus commented 8 years ago

I was using the same bits, but modulating based on a different bytes.

For With tableMask==2**16-1, I can't see how you could access e.g., bit 1 of table[0]. From index==0 follows that the two LSBs of hash are zero, therefore also hash >>> 8.

AFAIK it could work well for table.length <= 256 only and started to lose with every additional bit up to table.length <= 256*64, so that there were only 256*64 bits in use for such table sizes.

You can drop INDEX_MASK as Java guarantees the shift distance is taken mod 64.

Note that this is a utility for the simulator when experimenting with policies.

What about calling the method ensureCapacityAndMaybeClear? It should do exactly what you want, but with such a name it's perfectly safe even when taken out of context by someone else 100 years later.

Performing it as a re-initialization vs. new instance is about code organization and ease of use.

I guess you're clearing the filter from time to time anyway, so clearing it when growing does no harm (especially as it never shrinks).

I wonder if you could do better when using no intermediate int. My point is that collision with 2**32 bits start already at about 2**16 entries. Sure, you need many collision before you get problems.

ben-manes commented 8 years ago

You're right. For the tests I was performing the maximum entries in the bloom filter was 256 which led to a very small table. This was to explore adapting the W-TinyLFU window size. I very clearly goofed in my train of thought for how to pick the bit index. 👍

You can drop INDEX_MASK as Java guarantees the shift distance is taken mod 64.

I'm not surprised that the JVM optimizes the modulus, but I do like the clarity by not making that assumption in bit manipulation code.

What about calling the method ensureCapacityAndMaybeClear?

I tweaked the JavaDoc and reduced the visibility, so hopefully that's good enough.

I wonder if you could do better when using no intermediate int.

It would be interesting to hear more of your thoughts here.

I integrated all of the suggestions and adapted @ashish0x90's method into a unit test with a pretty-printed table. Its now possible to plug into the simulator other membership filters, e.g. a cuckoo filter.

Don't stop discussing due to the automatic closing to the ticket :-)

ben-manes commented 8 years ago

Trying to think through my original logic, what I should have done is shifted when calculating the table index. That is a little more natural if one thinks of it as modulating by the number of bits as a flat plane. When grouped into an array the higher bits represent the index and the lower in that location. The shifts and masks would have been optimized versions of that math. That would have had the same effect as the shift by 24, but been more obvious why the shift is taken.

Maaartinus commented 8 years ago

I'm not surprised that the JVM optimizes the modulus, but I do like the clarity by not making that assumption in bit manipulation code.

It's not an optimization. The behavior is defined in the JLS and luckily it's defined in exactly the way most CPUs (including i386) do it.

So I'd call the shorter version clearer (as any additional operation can make someone ask WTF?). Unfortunatelly, unlike Javadoc, the JLS is more than just one click away.

It would be interesting to hear more of your thoughts here.

I've tried it and got no improvement (tried up to 2**24). The measurement may be influenced by the pure quality of Random, so I've tried SecureRandom, too.

I'm still on Java 7, so I can't directly use your code (and what I did is not worth posting).

Trying to think through my original logic, what I should have done is shifted when calculating the table index.

Sure, just as if we could index bits directly. In some over-optimized code I did

return (array[hash >>> distance] << hash) < 0;

which uses the 32-distance MS bits for indexing and the six LS bits for bit position (where bit zero is the sign bit), so that one operation is saved.

ben-manes commented 8 years ago

Over optimized code for data structures is sometimes a good thing, so feel free to send PR or a gist of anything you've done.

I don't think I grok your last bit of code (<< hash), but I agree with the statements. I think since the array is a power-of-two the tableMask could be inlined into the shift, which is probably what you're getting at.

Maaartinus commented 8 years ago

It looks like your table gets problems at 2**27 and 2**28 elements with 5% and 19% false positives. The cause seems to be the reduction to int causing collisions. At 2**26 elements there's no problem yet (3.7%).

Over optimized code for data structures is sometimes a good thing

When I get some good result, I will post it all. I'm just fooling around with setting two bits in each long, i.e., doing two steps in one for the price of violating the independent hashes rule. It's like forcing the second index to be equal to the first one in the original algorithm. It looks like this

public void put(long e) {
    e = spread(e);
    setTwo(e);
    e = respread(e);
    setTwo(e);
}

private void setTwo(long e) {
    table[index(e)] |= firstBitmask(e) | secondBitmask(e);
}

private int index(long e) {
    return (int) (e >>> tableShift);
}

private long firstBitmask(long e) {
    return 1L << e;
}

private long secondBitmask(long e) {
    return 1L << (e >> BITS_PER_LONG_SHIFT);
}

I don't think I grok your last bit of code

It's simple: Instead of masking the bit inplace (and constructing the mask using a shift) with

 x & (1L << hash)

I shift it and use a fixed mask like

(x >>> hash) & 1L

This returns a different value, but it's equivalent as all we care about is equality to zero. As the direction does not matter, I do

(x << hash) & Integer.MIN_VALUE

instead and then I leave out the masking as the bit to be tested is the sign bit. You might want to look at some usage, it's two lines only.

ben-manes commented 8 years ago

That does sound neat. I checked in my minor optimizations and I'll try to digest your insights. :-)

Maaartinus commented 8 years ago

Nice that you're interested. I've just added the reverse masking idea. I haven't profiled it, but I guess it's pretty fast due to reduced memory accesses. Instead of talking about it, I posted it. It's copy-pasted ugly temporary crap, but it illustrates the idea.

In BloomTest (my MembershipTest), I avoided storing the whole array as it made me swapping for the big capacities I tested. The other files aren't worth looking at.

ben-manes commented 8 years ago

I already found it and am playing now ;-)

Before I saw it, I tried switching to a LongStream and a sequential input. In that case it degraded at 2^29, while Guava did not. When I found and adapted your code, it degrades at 2^28 like you said. I think at 100s of millions of entries that's pretty reasonable given the potential usage for the cache.

I haven't played with your variants, but looking forward to.

ben-manes commented 8 years ago

One implementation quirk to remember is that I want to reuse the hash computations. The reason is that I expect to have 3 sketches in the cache serving different purposes, all hashed on once per access. That's applied in batches asynchronously (draining buffers), but still it seems nice to reuse the hashes than have different ones per sketch.

The usages will be,

A tiny bloom filter for adapting the window size (small recency-skewed caches)
A medium-sized bloom filter for the doorkeeper (to filter out one-hit wonders)
A medium/large count-min sketch to retain the frequency history (TinyLFU)

This is why the four hashes were used (rather than finding the theoretical optimal). It looks like your improvements to in variant 3 don't degrade at 2^28, but might not be translatable into reusable hashes. Though I'm not sure whether to favor the reuse between the bloom filter and countmin or the improved quality of the filter.

ben-manes commented 8 years ago

Oh shoot. Another issue is that you're taking advantage of the simulator providing long keys. In the cache we'll only have access to an integer (Object#hashCode) which degrades the quality of the hashing that we can leverage.

ben-manes commented 8 years ago

So running all three variants (Caffeine, BloomFilter3, Guava) and its consistent at a 32-bit element (Long.hashCode(e)). All degrade at 2^28 with a similar false positive percent. Sorry that I didn't mention that constraint earlier, which is what you were trying to work around with improved hashing. It pretty much ties our hands in how far we can push this.

Maaartinus commented 8 years ago

One implementation quirk to remember is that I want to reuse the hash computations.

I avoided the 32 bit intermediate in order to avoid collisions. All other changes were mainly to show some alternatives.

If you can reuse the computed value, then you can pass the hash or some preprocessed value (like smear(e)) as the argument to put and mightContain.

If you can't reuse the computed value, then I'd probably go for different hashes, so that collisions in one place don't correlate with collisions elsewhere.

It looks like your improvements to in variant 3 don't degrade at 2^28, but might not be translatable into reusable hashes.

The degradation is caused by using an int. With 2**28 inputs, there are tons of them mapped onto the same 32 bit quantity. Whenever an absent value collides with a present one, you get a false positive. I'd bet, you can keep your hashes when you extend them to long.

Note that Guava uses 128 bits of the input, I use 64 and you use just 32.

Oh shoot. Another issue is that you're taking advantage of the simulator providing long keys. In the cache we'll only have access to an integer (Object#hashCode) which degrades the quality of the hashing that we can leverage.

Then there's nothing we can do as the collisions are already there. As a typical cache has not many millions entries, it's fine. If you wanted to support huge caches, you'd probably need some 64 bit hash (using Funnel or some user-supplied longHash). And this not only for the Bloom Filter but everywhere.

So running all three variants (Caffeine, BloomFilter3, Guava) and its consistent at a 32-bit element (Long.hashCode(e)). All degrade at 2^28 with a similar false positive percent.

GIGO. No surprise here.

To summarize:

Avoiding 32 bit intermediates prevents collisions unless there are already there.
Using two bits per word makes it faster while partly giving up hash independence. The FPP does not seem to suffer.
Using one hash as the input for another one provides some more spreading, so you could make spread a bit simpler and faster (which I did not). You'd need to benchmark and to test with badly distributed values and with real data to find out more.

ben-manes commented 8 years ago

Yep, moving to 64-bit hashes would be nice if Java made that more idiomatic. It would reduce the chance of HashDoS protection code from being triggered in the wild, etc. For an on-heap cache its probably not worth introducing a custom mechanism.
This is a cute idea. It would require jigging the frequency sketch to be similar, which would probably make it a Counting BloomFilter instead of a CountMin sketch.
This was shown in Less hashing, same performance: Building a better Bloom filter. I had thought this would create data dependencies and the independent hashes worked fine when testing the CMS. Your 2-bits per word is cheaper than the data dependency due to better memory access behavior, so a nice win.

Maaartinus commented 8 years ago

It would require jigging the frequency sketch to be similar

I don't think so. The implementation of the BloomFilter does not matter as long as it keeps the promised FPP. Concerning CountMin4, a similar optimization can be done in a straightforward way, but it may work less well as there are only 16 counters per word instead of 64 bits. So constraining two operations to a single word may be too much.

I'm just thinking about using 8 bits counters. Using a byte[], they're surely faster and simpler, the problem is how to amortize the doubled memory cost (or how to use half the number of counters).

Can you show me some stats concerning what parts are most time-consuming?

ben-manes commented 8 years ago

The bloom filter as a way to reduce counters ("doorkeeper") would halve the number of counters. Since there would be less noise from one-hit wonders the use of 8-bit counters might be okay because the higher collision rate would be offset by the larger frequencies and fewer counters needed.

I have a very simple JMH benchmark. At the time it appeared that multiplication and memory access were the biggest costs, which was a natural expectation. Since the operations are batched and we can probably assume a Zipf access distribution, the memory accesses might be cached effectively (assuming surrounding code didn't flush it).

Maaartinus commented 8 years ago

I'm going to switch to Java 8 soon, so I'll write more when it's done.

I see that my question was unclear, so I retry: When using the cache with some real data, what percentage of the runtime do the main parts need? I mean, I could imagine making CountMin4 faster, but it doesn't matter when it's already way faster then the other parts.

ben-manes commented 8 years ago

That's a tough question. I have JMH benchmarks and CacheProfiler as a hook. Unfortunately this can be hard to profile and get meaningful results due to how fast the methods are. The best would be to use JMH with perfasm, but I don't have a Linux machine (cannot be used in a VM).

Due to the eviction policy work being performed in batch asynchronously, it shouldn't penalize the response times. In that case adding an element to a sketch is the more common operation and likely the largest penalty. As with most caches evictions are the slowest operation due to hash table writes, etc.

From a concurrent perspective the buffers are a synchronization point. The read buffer is a mpsc ring buffer and its hard to tell if the padding is beneficial. When full additions are dropped, so a slower drain might be incorrectly viewed as a gain. The write buffer is bounded, so a benchmark quickly exhausts it and forces writers to wait until free space is available. That means the drain throttles writes, but most likely only in a synthetic test would this occur.

My view is that once the single/concurrent throughput is high then the cache shouldn't be a hotspot. A further gain to that throughput is great, but likely has an insignificant impact. Then the goal is to reduce latencies, how widely it varies, and hiccups. Here the buffers serve to try to avoid that on user-facing threads and penalize a background thread (assuming the norm of spare cpu cycles). This is harder to measure and falls into the complexity that Gil Tene talks about. And of course higher hit rates are a huge latency win, so that's critical.

From afar I would expect the biggest gains would come from improving the read buffer and the sketch write time. Sometimes I wonder if it would be worthwhile draining into a multiset so that the sketch is given an increment count (less hashing, better cpu efficiency). Probably a run length encoding scheme instead of a true multiset would work well.

Maaartinus commented 8 years ago

Wow! That's damn complicated. Given that, I'd stay away from all tools until I'd find some tiny piece to be optimized. I'd write and measure some real-like code instead: multiple threads, each running some known access trace in a long loop, measure the work done in a few minutes. This has tons of problems, but so do profiling and microbenchmarking.

In order to find out how much some part takes, I often use a trick of doing the part twice. I guess this could work for the sketch write time (write to two sketches, read the first and blackhole the second; or read from one randomly chosen), but not for the read buffers. The results may or may not be realistic, e.g., because of a single buffer just fitting in the CPU cache (and two buffers causing many misses).

Of the problems you mentioned, it could help with

hard to tell if the padding is beneficial (just switch it off and re-run)
a slower drain might be incorrectly viewed as a gain (not when it causes cache misses)
a benchmark quickly exhausts it (a real-like benchmark does not)
higher hit rates are a huge latency win, so that's critical (a real-like benchmark measures it right)

the sketch write time

Here, the doorkeeper obviously helps somehow. I guess that conservativeIncrement could make it faster (not slower), when used with an optimizied version of incrementAt: When min<15, then a simple addition works (no overflow possible). When min==15 and all values were right-shifted, then the situation is the same. When min==15 and nothing was right-shifted, then just do nothing (already saturated). When only some values were right-shifted (IncrementalResetCountMin4), then you're out of luck.

Now, some dirty tricks gaining speed for the cost of a reduced hash independence come in my mind.

Sometimes I wonder if it would be worthwhile draining into a multiset so that the sketch is given an increment count (less hashing, better cpu efficiency).

Could you elaborate?

ben-manes commented 8 years ago

I often use a scrampled Zipf because it best mimics real world (a few hot entries). That helps create contention and lets us leverages the cpu's cache, whereas a uniform distribution wouldn't. Using a micro-benchmark tool like JMH is really helpful, but overuse isn't if you measure the wrong things. I have a few that I've tried to be careful about which I think are the most realistic.

hard to tell if the padding is beneficial (just switch it off and re-run)

The padding is tricky because it depends on the machine. A consumer laptop would have a single socket with a moderate cpu cache. A server might have multiple sockets with a large cache and we hit NUMA effects. The last time I tried turning it off there was a benefit for my laptop and a penalty for a 16 core cloud machine. I think I also tried the 32-core (2 socket) at one point. But my benchmarks could have been flawed or another mistake, so I don't know if that's true. I'm also unsure about if NUMA and false sharing interacts differently than on a single socket.

Here, the doorkeeper obviously helps somehow...

From my early analysis it looks like the doorkeeper might make sense to have only for moderate to large caches. For small ones it can have a small negative effect to the hit rate and the space savings is tiny (e.g. 128 longs). For a large cache it has no negatives and has a 25-50% savings. The number of counters needed doesn't grow linearly, but I'm not sure what the factor is. We'd probably need to plot of a couple traces to figure out a threshold and slope.

(misc)

Another aspect I forgot to mention was GC friendliness. A read produces no garbage, but a write produces a little that should reside only in young gen. This would be the task (AddTask and RemovalTask), the hash table entry, and any lambdas. I didn't see a gain when I tried to remove the AddTask (inlined onto the Node) and removing the lambdas would require forking the hash table which I am not favorable towards. I think this is okay, but wanted to mention GC as an important factor to watch.

Could you elaborate? (micro-batching)

In an LRU cache a read is performed as the steps of finding the entry in the hash table and then reordering it on the LRU list. The reordering is a mutation, so despite being fast a lock is required and that causes contention. Instead we store the event into a lossy buffer and when full the events are replayed on the LRU. This avoids lock contention as readers/writers don't have to wait, the policy stays non-concurrent, and we might get CPU efficiencies by batching the work. That made it easy to switch from LRU to TinyLFU by enhancing the policy.

Each event is replayed in isolation: poll the event, add to sketch, reorder. If we assume a Zipf distribution then many events are for the same entry. For those hot entries the sketch's counters would quickly reach their maximum and the LRU reordering wouldn't really do anything. While the reordering is cheap, the sketch update is more costly due to hashing and scattered array reads. Its still fast, but might be in vain.

The ideal micro-batching would be to drain the buffers into a multiset (entry -> count) and then replay each entry once. This would improve the I$ by shortening the loop code. It would also let us hash and update the sketch once per entry, thereby avoiding wasteful work. But something like Guava's HashMultiset isn't a gain because we're hashing again each time.

The idea is a middle ground by using a run-length encoded batch (e.g. A3, B1, A5). When adding we check elements[i] == e and respond accordingly. I pushed a branch of some work in progress on this idea.

Maaartinus commented 8 years ago

like the doorkeeper might make sense to have only

I meant w.r.t. the speed. It should be way faster than CountMin4 and save it some work.

For small ones it can have a small negative effect to the hit rate

Any idea why?

forking the hash table which I am not favorable

I see, it's a huge monster. An advantage of forking would be the access to HashEntry.hash, which would save us hashing and spreading.

If we assume a Zipf distribution then many events are for the same entry.

Yes, but not necessarily following each other.

I pushed a branch

Now it's clear. I'm a bit sceptical about the efficiency of RLE as the entries may be interleaved. I was thinking about a middle ground between RLE and Multiset and pushed something. It has got far too complicated when compared to what I wanted.

In case interleaving is a problem, there's a simpler way: Use RLE which compares against a few most recent entries.

Anyway, I'm afraid that managing any additional data structure eats most of the savings it provides. I'd integrate the micro-batching into CountMin4 itself. In case of the RLE it's pretty straightforward: Just add two fields: lastEntry and count and... you know.

This data structure mangling is ugly, but it saves a lot of overhead. A much nicer and maybe equally fast solution would be a BatchingFrequency implements Frequency responsible for the batching and forwarding to another Frequency. I was curious and wrote it and it's surprisingly trivial. It's actually RLE, but on the consumer side, so no arrays are needed.

I'm curious about the typical size of the ReadBuffer?

Here you probably want >=.

ben-manes commented 8 years ago

Any idea why?

I'm not sure. Its probably has something to do with entries arriving near the end of the sample period. Since the reset clears the filter those entries don't get a +1 boost as for as long as early arrivals. Since this was only for small caches it makes sense that the impact would be more pronounced. For a large cache where frequency dominates it wouldn't be an issue.

I see, it's a huge monster. An advantage of forking would be the access to HashEntry.hash, which would save us hashing and spreading.

There are definitely efficiency gains at a high complexity cost. That's a lot to accept when starting from scratch and falls into premature optimization. We could justify late into development as an experimental spike, once all the other complexities were resolved. Even then its a big hurdle that could be difficult to see the benefits in practice.

I think VarHandles and Value Types are blockers. They have a big impact on the hash table and it would be a lot of work to port over changes. I'm really keen on seeing how value types evolve, but JDK10 is a long ways off.

It's actually RLE, but on the consumer side, so no arrays are needed.

Oh that's a much better idea. We should probably integrate that into the CountMin4 so that reset handles it properly.

Here you probably want >=.

Thanks! :-)

I'm curious about the typical size of the ReadBuffer?

Currently it is sized at 16 elements per ring buffer. I use a variant of j.u.c.Striped64 to dynamically grow a table of them when contention is detected. For a cache with little concurrency there is low overhead and if there is a lot then it scales to meet the demand (up to 4x number of cpus). If we had a 16 core machine then it a maximum 1024 elements. Since each element is padded to take 64 bytes (1 cache line) then it would cost 64kb. Another way of thinking about it is 1kb per ring buffer, with up to 4x NCPUs based on contention.

ashish0x90 commented 8 years ago

The padding is tricky because it depends on the machine.

@ben-manes Assuming that the primary purpose of padding fields here is to avoid false sharing, did you also try sun.misc.Contended?

ben-manes commented 8 years ago

Unfortunately @Contended in application code requires the JVM flag -XX:-RestrictContended or else it is ignored.

ben-manes commented 8 years ago

@Maaartinus I implemented your approach for RLE and ran a few cursory simulations to see the effect. The vast majority of run lengths is 1, then 2, and a few times 3-4. In the recency-skewed workloads where consecutive accesses would be most likely it dropped roughly 1 of every 30 increments. In a frequency workload it wouldn't even activate. The longest run I saw was 34 in a short LIRS recency trace.

The other negative I hadn't considered was that it reduces the sample period. If I add the run count (or maxed to 15) after an addition it may over increment the sample counter. That causes a reset (halving the frequencies) to occur more frequently. This would be noise in a large caches, but hurt small ones where resets happen too readily. If we determine the actual increment value then it adds more conditionals to get the minimum. All the extra logic probably negates the benefit of not doing a few extra hashes.

So I think my idea of run-length encoding was a flop. But it was a questionable idea to begin with.

I think the next steps is to integrate the adaptive window technique into the cache. That had a nice boost for small caches with recency skewed workloads.

The other task is to analyze the doorkeeper. My cursory experiments indicated that it could hurt small caches, but could greatly shrink the CountMin4 with large caches without impacting the hit rate. As the size increases I think so does the factor the sketch can be reduced by. This hypothesis needs to be validated and the slopes calculated. If we could determine an equation for calculating the sketch's size that would be pretty nifty. However we're talking about an overhead of a few megabytes for millions of entries (e.g. database), so it may not be worth it.

Maaartinus commented 8 years ago

The vast majority of run lengths is 1

That's too bad. An extension would be to remember the two most recent samples, but given the other problems it's probably not worth trying.

the benefit of not doing a few extra hashes.

Then I'd suggest some rather low-level optimizations speeding up the process. I haven't time to do anything serious (and still haven't switched to Java 8), but I'm sure there's some gain possible:

single spreading common to BloomFilter and CountMin4
simpler operations in both for obtaining multiple hashes (using rehashing of a long)
branch-free incrementAt
optimized conservativeIncrement

See here for the ideas (nothing tested yet).

However we're talking about an overhead of a few megabytes for millions of entries (e.g. database), so it may not be worth it.

Right, but we may also gain some speed, if 1. the doorkeeper is way faster than the frequency and 2. it catches a good deal of accesses. But this is questionable, too.

Maaartinus commented 7 years ago

Hi Ben! I'm writing here, as it's the place where we stopped half a year ago. Finally, I'm on Java 8 and can run your benchmarks and inject my ideas.

It looks like I can get some speed improvement for the FrequencySketch, it just isn't as good as I hoped (maybe 20% only). Anyway, I believe, that much more can be done.

Would you kindly answer these questions, so I can get on the right track?

There's no BloomFlter there, except in the simulator, right? Why? Wasn't if useful enough, or did it cause problems, or were you busy elsewhere?
There's no conservative increment implementation yet, right? Why?
What is the ratio of the number of calls to frequency vs.increment? Sure, it depends on the R/W ratio and the miss probability, but some estimate would help.

ben-manes commented 7 years ago

Hey! I'm glad you're back (and on JDK8!). :)

The BloomFilter works great thanks to our fixes, but the usages aren't robust enough to promote it yet. I had hoped it would serve two purposes,
- Doorkeeper: This would reduce the FrequencySketch sketch size (per paper). But it increases the frequency-bias of the cache, causing worse performance for recency-skewed traces.
- Feedback: Evaluate the window entries rejected by TinyLFU and adjust the admission window size based in a sample period (increase on 2nd rejection, decrease otherwise).

Unfortunately the feedback idea worked great for small traces, but not for a large trace (scarab). That meant I couldn't integrate the doorkeeper yet, but the sketch overhead is small so its not pressing. See (4) for a new direction you might be interested in helping on.

Conservative update is an option in the simulator, where I forked the FrequencySketch. It turns out not to have an observable change (diff is noise). This is probably because TinyLFU doesn't actually care about the counter value. Rather it cares about the difference between two frequencies in order reject cold entries in favor of hot ones ("heavy hitters"). If two are so similar that a smaller error rate changes the decision, then they probably have an equal likelihood of reuse. That would mean either is a good choice, so the extra cost of CU isn't worthwhile.
increment is called on a read or write, and frequency is called on an eviction (twice). So frequency is called much less often. Note that since this work is, by default, delegated to a background executor (commonPool) it's usually not an observable cost.
For adapting the admission window based on the workload, I recently had a new idea. At the end of the paper there is a chart of the hit rate vs. window size, which shows a very smooth curve. A hill climber algorithm could be a perfect fit for finding the optimal configuration dynamically, and adjust when the workload changes. I wrote a very naive version using simulated annealing which works pretty well on small and large traces.

I think using a hill climber has a lot of potential. The current version still suffers from jitter due to sampling noise, which simulated annealing helps to alleviate somewhat. The only paper I found similar to this is from Intel, The Gradient-Based Cache Partitioning Algorithm, where they use Chernoff Bound rather than a fixed sample size. I don't have experience in this area, but I do think its one of the most promising topics to focus on.

Another area of interest is variable expiration. Typically this is done using a O(lg n) priority queue or lazily leaving dead entries and relying on maximum size. Both could be done easily in user code and are not very satisfying solutions, and current policies are O(1). A Hierarchical Timer Wheel could be the answer, but care is needed to ensure we don't suffer from its degradation scenarios often. This is an idea that I've read up on, but haven't prototyped.

Most likely (4) and (5) are the most valuable areas to focus on and could be fun to explore.

Maaartinus commented 7 years ago

0. Tons of questions

I'm pretty lost, but it's getting slowly better. Am I right with the following:

The standard policy works like ...simulator.policy.sketch.WindowTinyLfuPolicy.
Despite speaking about admission policy, on a cache miss, the key gets always inserted into data (i.e., into a ConcurrentHashMap` when not simulating).
The admission is about deciding if an EDEN or a PROBATION key should get evicted.
The EDEN is typically small. It's helpful when a key gets accessed in short succession.
"main uses Segmented LRU" - do you mean "main = probation + protected"?
There are only the following transitions: new -> EDEN -> PROBATION <-> PROTECTED plus dropping from EDEN and PROBATION.
There's nothing like PROBATION and PROTECTED in the paper. What would happen if there was just a single combined queue?
While all three parts use LRU, LFU comes in deciding about the promotion from EDEN to PROBATION.
While using three queues for maintenance, there's jsut one ConcurrentHashMap used for the lookup.
The counters can't be stored in cache entries as they're needed for absent keys as well.
In order to keep the memory footprint small, they get stored based on the hash rather than the key.

1. `BloomFilter`

"Doorkeeper ... increases the frequency-bias of the cache" - but according to the paper, it only should save memory. It shouln't change the behavior, except due to the increased truncation error, should it?

"sketch overhead is small" - OK, I'll give it just one more shot or two. I'm just trying a byte[] instead of the long[] table, taking the low and high nibbles on the even and add steps, respectively.

This is like fixing one bit when choosing the nibble, so it shouldn't be a problem. I see now, you're constraining the nibble choice even more (start has 2 bits, while 16 would be needed for choosing the nibbles from 4 longs randomly).

Using byte[] is a bit faster as it saves some shifting (reset is much more expensive, but it's rare enough). It'd allow to use tables for the saturated increments (twice 256 bytes), but I'm rather sceptical about trading trivial computation for memory access (even when there should be hardly any L1 misses due to batching and tiny table size).

I could imagine letting the FrequencySketch play a doorkeeper for itself: Depending on the state of one nibble, exit the increment loop early. For example, you could exit when its old value is even. This surely changes the way how frequency gets obtained and it may badly influence its quality. It may help the speed a lot, assuming that the vast majority of cases is not saturated.

2. Conservative update

I see, I missed the fork (i.e., the class I was playing with last year:D). the extra cost of CU isn't worthwhile - I had a benchmark, where CU was faster, but this somehow doesn't make sense.

4. `HillClimberWindowTinyLfuPolicy`

I guess, this is what I'll play with when I understand more of Caffeine.

Is HillClimberWindowTinyLfuPolicy.headWindow the same thing as WindowTinyLfuPolicy.headEden?

5. Variable expiration

Also very interesting. When the request came, I read everything I could find about Hierarchical Timer Wheel (which isn't much). I haven't looked on any implementation yet.

ben-manes commented 7 years ago

All questions are wonderful. I'm going to edit your message to number the sub-items.

ben-manes commented 7 years ago

You may feel a little overwhelmed, but you're nailing every question. Since Caffeine's code is complicated, its very useful to disregard it by instead focusing on the tooling (e.g. simulator, sketches). That's been useful for me so that I can explore ideas separately from the cache itself.

0. Background

Yes, with slight pragmatic differences (e.g. HashDoS)
Yes. I use the term "admission window" where new entries always enter for evaluation. TinyLFU is an admission policy and in our setup it guards the main cache (99%). This makes it generational, whereas the original conference paper had no windowing.
Yes.
Yes.
Yes. Segmented LRU is an eviction policy.
Yes.
It is discussed when describing SLRU, but briefly since there is a paper referenced to describe it. If using a simpler LRU it performed nearly the same and was my original version. But I observed a small improvement hit rates on the WikiBench with SLRU because a better victim was chosen. Since most workloads don't need a window, you can observe the effects with Lru_TinyLfu and the like.
Yes.
Yes.
Yes. Other policies use "ghost entries" to keep absent keys in the hash table.
Yes. Note that leads to an interesting HashDoS attack we mitigate in the library.

1. BloomFilter

That was Roy and Gil's evaluation with their limited traces and I got involved later (from reading their conference paper). Their traces were very Zipf like, so frequency biased, so there was no negative effect. I hadn't evaluated the doorkeeper until we added the BloomFilter. When I did, I saw it did negatively impact recency-skewed traces (this was most likely on small traces). My rough guess is that a more frequent reset (clear) hurts recency by more often seeing them as "one-hit wonders" and diminishes their counts.

Its interesting to hear byte[] is faster. I figured with hardware caching and cpu optimizations, the difference wouldn't be observable. But since reset is O(n) and we lack SIMD support, using a long[] would benefit its bit manipulations.

Another idea is to store all counts for a key in a single slot. See One Memory Access Bloom Filters and Their Generalization and A Comment on .... This could increase the error rate but improve memory access patterns.

4. Hill Cimbers

Yes, headWindow was a rename of headEden in case Gil or one of Roy's students wanted to hack. Eden was my original term (as I was borrowing from GC generations) and when describing the idea to Doug Lea, he referred to it as windowing. That was a much more apt description, though I never cleaned up terminology in the code.

I tried to factor the code so that we could easily try different HillClimber types, especially someone less familiar with the code. The simulator code is lower quality (as experiments), so could have bugs or unclear. Its also fine to copy/paste there to keep from coupling experiments, since its not production-oriented code.

FYI, Roy and Gil would want to write a paper on this improvement if we had good results. Gil would like to help next month (code, math, or where ever) when his current project winds down.

5. Variable expiration

@ifesdjeen wrote a nice implementation worth reading and has helpful pointers to references. If we had one it would be tickless ("deferrable") and we might follow Linux's variant.

However, its an interesting question of whether their worst-case is our common-case. The algorithm is optimized for timers not firing and instead being added/removed frequently. This would hold for time-to-idle expiration (any access) but probably wouldn't for time-to-live (on write). The latter is more useful by adding a "freshness" concept, e.g. we discard an entry if it is too stale and instead miss if read. An implementation would need to ensure it struck the right balance. Also note that there is a C project that claims O(1) in all cases, but I haven't investigated to understand its trade-offs to get there.

Maaartinus commented 7 years ago

1. `BloomFilter`

My rough guess is that a more frequent reset (clear) hurts recency by more often seeing them as "one-hit wonders" and diminishes their counts.

The BloomFilter allows for smaller FrequencySketch and this leads to more frequent reset, right? Maybe using longer counters could help? If you can make the FrequencySketch four time smaller and use six bits per entry, you still win a factor of 4/1.5 concerning the memory footprint and the reset frequency stays the same.

Its interesting to hear byte[] is faster. I figured with hardware caching and cpu optimizations, the difference wouldn't be observable.

I could avoid all shifting in increment, s. https://raw.githubusercontent.com/Maaartinus/caffeine/514e7a0876ab6a4345a0d78203711ed350229075/caffeine/src/main/java/com/github/benmanes/caffeine/cache/MgFrequencySketch1.java

Benchmark                                    Mode  Cnt         Score         Error  Units
FrequencySketchBenchmark.frequency          thrpt   30  40489828.815 ±  406619.027  ops/s
FrequencySketchBenchmark.frequency1Regular  thrpt   30  52234258.368 ± 2345943.230  ops/s
FrequencySketchBenchmark.increment          thrpt   30  34641127.583 ± 1129743.478  ops/s
FrequencySketchBenchmark.increment1Regular  thrpt   30  62642190.431 ± 1928634.127  ops/s

A good part of the speedup comes from my sloppy hashing, which might be too sloppy. It all may be wrong, as I haven't tested it yet.

But since reset is O(n) and we lack SIMD support,

Do we? Isn't ByteBuffer.asLongBuffer helpful? Or Unsafe?

using a long[] would benefit its bit manipulations.

Actually, what I did is applicable to long[] as well. A single rotation per incrementAt could suffice, but unfortunately, we need both 0xfL << offset and 1L << offset.

Another idea is to store all counts for a key in a single slot.

I'm going to read the papers. It might be similar to my two-bits-per-long BloomFilter idea.

Another posibility to reduce the overhead might be special treating the most frequent entries (in a sort of HashMap with no rehashing nor chaining).

4. `HillClimber`

Yes, I thing writing an own HillClimber should be easy once I get an idea. I see, it only shuffles from maxWindow to maxProtected or back, so their sum is kept constant. There's no maxProbation, but wouldn't changing this sum be possibly also useful?

5. Variable expiration

One nice thing is that missing the timeout is no big deal. I'll read the links now.

ben-manes commented 7 years ago

1. BloomFilter

The BloomFilter allows for smaller FrequencySketch and this leads to more frequent reset, right?

Nope, the period is based on the cache size. I meant that a small trace is for a small cache, so it resets regularly. The doorkeeper reset is a clear of the bloom filter. That could mean often thinking arrivals are one-hit wonders which its trying to filter out.

Also the reduction equation doesn't seem to be linear. If we reduce the FrequencySketch too much it has a high error rate and impacts the hit rate. For large caches a 4x reduction was fine, but only a 2x for small caches. So we'd want to come up with an equation for that I suppose.

A good part of the speedup comes from my sloppy hashing, which might be too sloppy. It all may be wrong, as I haven't tested it yet.

Nifty. That makes sense.

Do we? Isn't ByteBuffer.asLongBuffer helpful? Or Unsafe?

Good point. If we used those we'd have the best of both worlds.

Another posibility to reduce the overhead might be special treating the most frequent entries (in a sort of HashMap with no rehashing nor chaining).

Gil has a sketch called TinyTable, which is similar to a Cuckoo Filter. These have more expensive insertion costs but are very compact.

4. Hill Cimbers

There's no maxProbation, but wouldn't changing this sum be possibly also useful?

Should be and we should try once we have a good climber. That would only impact traces where LRU is optimal, but most would be somewhere in the middle. I think we should revisit this, but it won't be pressing until we have a solid climber.

5. Variable expiration

One nice thing is that missing the timeout is no big deal

Another is the cost is usually not user-facing, since replaying is async. Most apps have spare cpu, so we can take a small hit without impacting latencies. But some use same-thread executors instead so its not a sure thing.

Maaartinus commented 7 years ago

1. BloomFilter

http://www.cise.ufl.edu/~tali/1569342377.pdf That's what I also did: multiple bits per word. The paper is more general and contains math. I'm afraid, it doesn't generalize to the counters well, as we have just 16 of them per word, but it could work.

http://sci-hub.cc/10.1109/TPDS.2014.2378268 Fixed formula, I may care about when in more mathematical mood.

http://sci-hub.cc/10.1109/2.268884 Cool! Isn't using 4-bit frequency counters sort of equivalent to using 16-way-segmented LRU cache? Every time an entry gets accessed, it gets promoted to a more protected area, from which the LRU entry is demoted. Sure, this does nothing for absent keys, but the analogy might lead somewhere.

Nope, the period is based on the cache size.

Do you mean period in PeriodicResetCountMin4.ensureCapacity? It is, but so is the table size. And I can't see the table size being reduced when a Doorkeeper gets used. +++ I see, I missed the countersMultiplier

That could mean often thinking arrivals are one-hit wonders which its trying to filter out.

So the increased round-off error due to the Doorkeeper is more relevant for small sized and bites us, right?

Or should the period be bigger when the Doorkeeper gets used?

For large caches a 4x reduction was fine, but only a 2x for small caches.

A reduction factor of 1/0.75 for small caches should be easily doable (the reset seems to be cheap enough). Nothing like this can be done with the Doorkeeper, but we could try to reset it only every n-th period.

5. Variable expiration

I don't think that a Hashed Wheel should be used for expiration. For scheduling it's surely useful, but for expiration I don't think it's worth it. Imagine our time resolution is 1 ms and someone specifies an expiration exactly in 1 hour. Do we care about the exact time or is it good enough to remove the entry one minutes later? I guess, the second as the only advantage of doing it at the exact time is that a cache slot gets occupied 60 minutes instead of 61.

IMHO using lower precision for far future is fully acceptable and reduces the overhead a lot. We need precise slots for the near future, but buckets like "60 to 61 minutes" or "120 to 122 minutes" are fine. After 60 minutes, we would re-label the buckets to "60 to 61 seconds" and "61 to 62 minutes", respectively (the ranges shrink to the right as slightly missing the timeout is what saves us from having to process entries repeatedly). We can still maintain exact expiration by checking the exact time on each read. The only cost of my proposed simplification is that the entry occupies the cache 2% of time longer than needed (assuming it doesn't get evicted earlier).

If I'm right, then we can use exponentially growing buckets and this way keep both their number and the overhead low. The above re-labeling is like moving a piece of a time wheel down in the hierarchy. The big difference is that we do want to miss the timeout by a few percent.

A bucket can be implemented as a doubly linked list, so we need two pointers per cache entry (plus the expiration time as long or int). Additionally, one pointer and one sentinel per bucket; I guess, depending on the precision, hundreds or thousands of buckets should do.

ben-manes commented 7 years ago

Cool! Isn't using 4-bit frequency counters sort of equivalent to using 16-way-segmented LRU cache?

It is similar. Facebook's evaluation was that 4-levels is the best fit. See the S4LruPolicy where the number of levels is configurable.

I can't see the table size being reduced when a Doorkeeper gets used.

See the reference.conf file, under count-min-4. There are multipliers to tune the sizes.

Variable expiration...

Your description sounds very much like Linux's improved variant. They have a similar scheme where the resolution degrades the further out the schedule is.

I'm also unsure why you think its not worth it, because your argument and further thoughts match the ideas well. An exact expiration would be modeled by a O(lg n) priority queue, whereas timer wheels use coarse buckets instead of exact time. Then there are different layouts to choose based on the desired optimizations. So they too are very pragmatic to be good enough and fast. I think you're in agreement with the various authors.

Maaartinus commented 7 years ago

4. Hill Climbers

What traces are most interesting (really good or really bad) for them?

Can I run multiple traces and get separated outputs or some summary helpful for tuning the Hill Climber? I mean something like myHitRate / orgHitRate with best and worst and average values? I guess, not, but I'm asking before diving into the code as there's so much stuff there...

I guess, running traces of different formats at the same time is not supported yet?

5. Variable expiration

because your argument and further thoughts match the ideas well.

I guess, I re-re-invented the wheel. There's probably a big mismatch between what they wrote and what I understood (and could recall). I guess my ideas correspond with the Hierarchical Wheel in the improved variant (but not with the Hashed Wheel). Where "improved = sloppy" and my claim is that sloppiness is fully acceptable for cache expiration. So I don't think we can ever run into degradation scenarios (unless maybe with carefully prepared timeouts, where the programmer sabotages their own cache).

ben-manes commented 7 years ago

Sorry for another short response. It's late here so I'll be asleep soon, if the insomnia quits.

Right now multiple files per trace format is supported, but not multiple formats. That could be done using a prefix, e.g. "fmt:path", without much effort.

The stats might also be made more extendable. Currently I've used debug statements to prototype with. But a nicer scheme shouldn't be hard to add.

I think you're right about the timer wheels and pretty much everything else. :)

ben-manes commented 7 years ago

Here's some answers to your questions that I missed last night.

So the increased round-off error due to the Doorkeeper is more relevant for small sized and bites us, right?

That's my guess, or it is something related to that.

Or should the period be bigger when the Doorkeeper gets used?

I'm hoping that adapting the window is still our best bet. The door keeper would cause the climber to adjust to a larger window. So we could rely on that to correct for us.

Hill Climbers: What traces are most interesting (really good or really bad) for them?

For recency,

Scarab traces: These are large caches (100k). A longer run is downloadable (see comment).
Cache2k traces: These are small caches. Sounds like a Hibernate session cache, so flushed manually or after a commit.
UMass Youtube, Lirs sprite, and ARC oltp are small

For frequency,

Lirs glimpse (gli.trace.gz) and loop. I think cs and ps, but would have to check. All small
ARC's DS1 and S3 are large

I was changing the default window size, running a trace, and seeing it adapt in the desired direction. I also would take two traces, e.g. glimpse + sprite + glimpse, and seeing it adapted each time the workload changed.

I guess, not, but I'm asking before diving into the code as there's so much stuff there...

There is a lot to digest. The simulator is simple and easy to extend. I'd ignore Caffeine's caching code as its complex due to concurrency, features, etc. So might be fun to read but not important until we have good schemes nailed down.

You're ramping up really quick with great questions and insights. :)

Maaartinus commented 7 years ago

Hi Ben,

unfortunately, I have no time for this at the moment. Probably in two weeks, it gets better.

I understand enough of your hill climbing, so I could try my ideas. But, despite the problem being univariate and having a smooth curve, this optimization is difficult, because of two things:

a lot of noise
a history dependence

The noise can be countered by using a big sampling period and/or making small steps. An evaluation doing more than just comparing the two most recent results could help, too.

The history dependence is much worse as the current hit rate depends on changes done many steps ago. We know these changes, but we have no idea which of them were relevant for the current hit rate.

I guess, these problems are not so bad in real usage as in simulation. In simulation, you want to try many different settings and algorithms and you need it to work fast. In reality, a server running for many days could profit even from very slow optimization progress. Obviously, reacting to changing input characteristics is a much harder problem.

I tried a simple variable step optimization where the step size slightly increases on an improvement and not so slightly decreases otherwise. This failed badly because of the noise, which made both outcomes about equally frequent and so reduced my step size to a minimum.

I fixed it, but got no really good results.

I'm skeptical about simulated annealing and other non-deterministic optimizations, as there's already more than enough randomness there. Randomness gets used for escaping local minima, but that's not the problem here. Even if there were local minima, they'd be no problem because of the noise giving us a chance to jump out.

Nonetheless, there are traces for which SA works better than the other algorithm and than what I tried. My claim is that it's not because of its non-determinism, but rather because of other differences between the algorithms. Or worse: It's just that lacking a really good algorithm, some algorithm simply wins for a given input.

We'd need some way of evaluating the optimizer globally, i.e., over all samples we can grab and also over some generated samples. This is surely something I could do. We'll need to select a single scalar value telling us what algorithm is better. It'll probably be some weighted miss rate, where longer traces and worse results count more.

Currently, I can't see an algorithm working well given the history dependence. Giving the algorithm more input than just the target function could help, but I don't know how to use the already provided input and I don't know how to provide other input efficiently. The idea is like "if we set the parameter higher, we wouldn't get this miss", but tracking this is complicated and we'd probably get other misses, which are even harder to track.

A funny idea could be to (optionally) collect traces in real usage and run an optimizer in a simulator using idle times of the application (or just in the background, assuming spare CPUs). This would work with a trivial optimizer like "simulate with a bit higher and/or a bit lower parameter, choose the better, rinse, repeat".

ben-manes commented 7 years ago

Being busy is understandable. I seem to be always in a crunch these days, too. I'm glad you're still thinking over the problems, as that's valuable too.

I agree noise and history are too big problems. I tried using a weighted average of the hit rates with poor results. Similarly a dynamic pivot size didn't work well for me either. A large sample resulted in too little movement and inability to correct well. The small sample results in too much noice so that the configuration improves but isn't optimal.

Simulated annealing isn't quite right, but did help reduce the impact of the noise. The poor version I wrote doesn't try to escape a local minima as you're right that would be unnecessary. But the idea of a temperature to cool off with did help reduce the impact of noise.

One idea is to adjust the sample size based on the temperature. When hot the sample is small to make rapid decisions. As it cools off, the sample increases to filter out the noise.

Intel's paper has an elegant solution to side step the sample period problem and deal with noise. They use Chernoff Bound as a confidence function. I recommend reading it as it sounds promising. Its probably worth implementing next to evaluate their approach.

I think the hill climber idea has promise, but finding a robust solution is going to be tricky. :)

ben-manes commented 7 years ago

@Maaartinus Can you send me your email (private is okay) so that I can loop you into conversations with @gilga1983 and Roy?

Maaartinus commented 7 years ago

Sure: It's my name at gmail.com.

ben-manes commented 7 years ago

You mean Maaartinus? Sorry, I don't know your full name :)

Maaartinus commented 7 years ago

Sure. Sorry for the confusion. If you knew my real name, it'd be ambiguous (first, or last, or both, or both reversed, ...). I was assuming you don't know my full name and you'd assume I mean the only thing I know you know. ;)

I guess, for implementing the idea from "The Gradient-Based Cache Partitioning Algorithm" we'd need to be able to run the cache in a split mode, so that we can evaluate two settings in parallel. This should help a lot with the history problem: The first half always uses a smaller parameter then the other. So when the first part performs better, then we know that "smaller means better". The only uncertainty remaining is "smaller than what" and is only relevant when crossing the optimum value.

Using the magnitude of change instead of its sign only is a pretty natural idea (which is sadly failed to try). I know that many algorithms (e.g., some neural network learning) ignore the magnitude and it helps them with badly conditioned problems - which is irrelevant for our univariate optimization. I guess, the magnitude helps a lot with the noise.

The Chernoff Bound is something I haven't understood yet, but it can't be too hard given their simple implementation. The idea of waiting for enough confidence is surely good.

All in all, I love the paper. I wonder if we really need the split, how complicated it is, and how running two smaller caches instead of a big one changes the hit rate.

ben-manes / caffeine

Very high false positive rate observed for BloomFilter implementation. #85

Current code:

New implementation:

Test method:

0. Tons of questions

1. BloomFilter

2. Conservative update

4. HillClimberWindowTinyLfuPolicy

5. Variable expiration

0. Background

1. BloomFilter

4. Hill Cimbers

5. Variable expiration

1. BloomFilter

4. HillClimber

5. Variable expiration

1. BloomFilter

4. Hill Cimbers

5. Variable expiration

1. BloomFilter

5. Variable expiration

4. Hill Climbers

5. Variable expiration

1. `BloomFilter`

4. `HillClimberWindowTinyLfuPolicy`

1. `BloomFilter`

4. `HillClimber`