dgryski / go-tinylfu

TinyLFU cache admission policy
MIT License
252 stars 34 forks source link

Size of the sketch #11

Closed dominikh closed 1 year ago

dominikh commented 1 year ago

I'm wondering about the size of our CM sketch. The paper and the implementation in Caffeine mention using 8 bytes per cache item, i.e. 16 counters, and that seems to indeed be the case: https://github.com/ben-manes/caffeine/blob/1eb57acfc7ae9bed04d6f43061dd3246956aaea2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/FrequencySketch.java#L92

Meanwhile, we seem to be using 0.5 bytes per item, i.e. 1 counter? https://github.com/dgryski/go-tinylfu/blob/cadd5d68bd66ccd16f66a0955aeadbc585328bcb/cm4.go#L61

Surely that's not right? Am I missing something?

dgryski commented 1 year ago

Yeah, I think the sketch sizing calculations are different. Let me work through this:

For go-tinylfu, given a cache size we round up the next power or two and allocate depth=4 nybble vectors each with this rounded-power-of-two counters. We have 4 * roundUp(size) total 4-bit counters.

Caffeine seems to end up creating a single array of counters sizeof(long) * roundUp(size) bytes wide, for a total of 2 * 8 * roundUp(size) = 16 * roundUp(size) 4-bit counters.

Yup, our CM-Sketches should be bigger. I probably misread the paper.

dominikh commented 1 year ago

I'll give fixing it a go if you aren't working on it.

ben-manes commented 1 year ago

I believe you were able to cut the sketch size because you implemented the doorkeeper optimization (bloomfilter). That avoids one-hit wonders from polluting the sketch, so you were able to cut the size with similar results. I also observed that in the simulator, a doorkeeper could reduce it by 1/2 or 1/4 with similar performance. I think that this change is probably undesirable and is wasteful since you showed good hit rates in your testing.

Caffeine doesn't use the doorkeeper optimization because it only increased the frequency-bias and I was concerned by the impact on recency workloads. The windowing was a partial solution, but still underperformed in recency skewed workloads. I also wasn't sure what a good sizing was because it varied with workload and the cache size, so a formula that balanced freq/recency/memory seemed hard. Later on I realized that I could use hill climbing to adapt the window size to solve the recency/freq bias. That could mean that we could adopt the doorkeeper optimization and cut Caffeine's sketch by 50-75%, where it won't degrade the hit rate and instead push a stronger recency adaption when needed.

So far no one has ever complained about the memory overhead of the cache's data structures, so it has not seemed necessary to use the doorkeeper optimization. That's probably because Java like Go is less likely to be used for a massive cache store, e.g. a large memcached node, where every ounce of metadata overhead should be inspected. For an api/middle tier server, more akin to 4-8gb, the caches are moderate and may serve as local cache with a remote cache tier. That makes it harder to justify spending the time for a deep analysis and maybe degrading the hit rate in an untested workload if the memory overhead is negligible. For another ecosystem, say if replacing Postgres' buffer cache, then a doorkeeper might make a lot of sense and be worthwhile, especially with other problems solved. As an engineer collaborating with researchers, it just did not seem to be where I should spend my time and I never got back to it.

dominikh commented 1 year ago

Increasing the sketch size had no effect for a number of traces, but improved hit rates by 5-10% for some traces, especially at low cache sizes. I don't have exact numbers anymore, but could reproduce them if you wanted.

There is probably room for optimizing the exact number of counters, but 8 bytes per item doesn't seem outlandish to me for a baseline.