ben-manes / caffeine

A high performance caching library for Java
Apache License 2.0
15.73k stars 1.59k forks source link

LHD #232

Closed adamretter closed 6 years ago

adamretter commented 6 years ago

Might be of interest to you http://www.cs.cmu.edu/~beckmann/publications/papers/2018.nsdi.lhd.pdf

ben-manes commented 6 years ago

Thanks, I read it a few weeks ago. Unfortunately the C code is not released in a runnable manner, making it more painful to implement for comparison. I am skeptical of their decision of comparing only with policies that lack historical insight, since theirs does too, which likely means it underperforms. The workloads chosen are not a representative set. Researchers tend to discourage implementations so unless the results are repeatable it’s worth being a cautious believer of their assertions.

adamretter commented 6 years ago

@ben-manes Okay cool. Shame about the C code. Nice to hear your take on it :-)

ben-manes commented 6 years ago

I would like to see this implemented in the simulator, as there are probably new insights to learn from. I don't know when I'll have the time. Since they didn't open-source their simulator, it will be a bit more work to verify a correct version.

http://www.cs.cmu.edu/~beckmann/publications/code/2018.nsdi.lhd.tgz

Maaartinus commented 6 years ago

I really can't rate their algorithm, but I find the idea of including the size valuable. When you just count the misses, then evicting bigger entries earlier definitely makes sense (unless the bigger entry is much more probable to cause a hit). With growing size ratio, this gets more and more important.

However, just counting the number of misses needn't be enough. It reminds me of the 0-1 knapsack problem as not only the size (weight) of an entry can vary, but also its value. Actually, all caches I know about, solve the problem assuming v_i=1 while maintaining sum w_i * x_i <= W, but without using the weights for anything else. Or maybe they optimize assuming v_i = w_i?

In some cases, the user could provide the value of an entry, e.g., as the time it took to generate it.

As a real example, I'm caching the JSON responses of my server. Their sizes vary a lot and therefore I choose to not cache responses above a threshold. The time their generation needs is strongly correlated with their size, but some queries are clearly over-proportionally expensive.

ben-manes commented 6 years ago

Sorry if I came off negative. I do like their ideas and think it could work well, but needs a more robust evaluation. Most papers I've read make bold claims and, when I test them, there are significant flaws the authors failed to mention. Sometimes this is new information, but it can feel intentional when the paper has an arrogant tone.

I would like to see this algorithm in our simulator. I think it would offer new insights on adaptivity and, like you said, could let us use weight as an eviction choice. Since they don't retain history and their published data, I think it can only match a classic policy. TinyLFU's idea of using a sketch and aging it could be adapted as storage (vs filter), instead of retaining the per-entry counters. That might work, but we'd need to experiment across lots of workloads to know.

The papers on policies that take a cost function (weight, load time) only compare against similar. Its not clear though if a policy with a higher byte-hit-rate is better than a higher hit-rate. I'd guess that some users would prefer favoring large and others small weights, and others just normalize to equally sized chunks instead (e.g. video). I've also found that user supplied cost functions are usually pretty bad (e.g. favor paying customers).

Maaartinus commented 6 years ago

I surely agree that the cost function is non-trivial. The load-time can be measured exactly, but it depends on the current load and the cost may get overrated a lot. Such entries could be then kept in the cache far too long. No idea, if this really happen, but some sanity check can be devised (in my use case, maybe classifying the requests according to the pathInfo, computing linear regression per category and limiting the cost for outliers).

I could also run experiments with different cost functions, but currently, the load is too low for this. Anyway, I guess, I can get a useful cost function at the end.

Since they don't retain history and their published data, I think it can only match a classic policy.

IIUIC for distributions, where it doesn't badly misbehave (like LRU failed by a sequential scan) and using v_i=1, it can easily win against any algorithm ignoring the weights when the weights vary a lot. The other thing is that the assumption v_1=1 is unrealistic.

I guess, with v_i=w_i, ignoring the weights in the policy is just right.

TinyLFU's idea of using a sketch and aging it could be adapted as storage (vs filter), instead of retaining the per-entry counters.

Sorry, I don't understand.

ben-manes commented 6 years ago

Sorry, I'd have to review knapsack again on v/w and how it relates to LHD. So I'm only lightly following that at the moment and you might need to dumb down what you're thinking a bit for me. :)

LHD does not retain history after items are evicted, so it only knows what happened to the working set while the item was present. For most policies that hurts how well they perform because once an entry is evicted, any insights about it are lost. Typically that insight is frequency, so it takes longer to recover if a poor victim was chosen. Since the ideal policy should adapt to mimic the optimal (MRU, LFU, etc) it will make mistakes and need to correct from that. My guess is that without maintaining any history outside of the working set (as ghosts or counters) it cannot recover as well. Sampling has the unfortunate chance of choosing between valuable items to pick from. In a loopy trace, sampled MRU has half the hit rate of linked MRU probably due to poor distribution choices. My thinking is if LHD was augmented, it might do even better. But first step is someone needs to implement it :)

Regarding weights...

Maaartinus commented 6 years ago

Your confusion is surely my fault. I'll try to imprecisely rephrase both problems so that my thinking gets clarified.

So I see the caching problem as a "dynamic knapsack" where items get inserted and excluded on every step. It's a far-fetched analogy, but the meaning of v_i and w_i matches well.

Caffeine has w_i but no v_i as its input. For the caching problem as I described it, it works well when v_i = w_i.

When weights and values and also v_i / w_i vary a lot, then an algorithm taking them into account has a unfair advantage.

Thank you for your answer and the links, I'll answer later (this will take quite some time).

ben-manes commented 6 years ago

Thanks for the refresher! That all sounds familiar now. 😄 It hadn't clicked that v_i was from the cost function, but that's a great observation.

Where it differs is that recency and frequency are additional variables, and their impact vary across workloads. Would you consider that to be part of an equation to compute v_i? Regardless, v_i would require some form of aging to smooth out the function (e.g. if load time was slow due to a temporary network timeout).

I do think that I need to review optimization algorithms again, especially more advanced ones. At work we're starting to run into the nurses scheduling problem for dispatching and routing, so I should probably get a few good books to prepare for that eventuality.

Maaartinus commented 6 years ago

Where it differs is that recency and frequency are additional variables, and their impact vary across workloads. Would you consider that to be part of an equation to compute v_i?

No, for me v_i is an input to be set by the user. It's the degree of happiness :D the user says they gain every time, the item i is wanted and present. Recency and frequency are heuristics you use to maximize the gain and you may want to use an internal v_i' which takes them into account and is subject to aging and other adjustments. For the success evaluation, the user-supplied v_i should be used as is.

aging to smooth out the function (e.g. if load time was slow due to a temporary network timeout).

In theory, this should be the user's responsibility as it's the input, a parameter of the target function rather than a heuristic. OTOH preventing users from shooting themselves in the feet can be important and aging can be a simple solution. I guess, I'd use some faster aging initially and I'd like to switch to a slower aging once I can handle outliers well enough.

Actually, when the user wishes aging, then it's a different target function.

Currently, I don't have time to read the papers or to think seriously about it all. :cry:

ben-manes commented 5 years ago

I spent some time hacking their C++ simulator to get a comparison. They use a custom binary format and don't provide the translation from SNIA's traces, so I didn't try a trace where the entry size varies. Instead, I wrote a translator from my trace reader to their format, printed the keys they read, and verified that it was identical. I also checked that their LRU hit count matched mine, so the comparison is accurate. I did not adjust any of the LHD parameters from their defaults (admissionSamples, assoc), though in small traces I didn't see any difference if changed.

In summary, the policy is competitive with ARC, but can fail badly where LIRS and TinyLFU do well. In particular a loopy trace, like glimpse, shows poor scan resistance. That makes it less interesting for SQL and analytical workloads. I couldn't find a case where LHD was a top performer. In the adaptive stress test (loop + corda), their "dynamic ranking" did not impress, whereas Caffeine is near optimal.

I suspect that the varying entry size traces won't prove to be much better. The same group authored AdaptSize at roughly the same time, which is their policy optimized for those workloads. A while back I got curious and compared TinyLFU to AdaptSize using their simulator and CDN trace. Caffeine trounced them, painfully so.

I think LHD has some interesting characteristics, but their implementation is annoyingly slow and fails to distinguish itself. I don't think it is worth porting to Java for future comparisons and the paper was full of exaggerations (e.g. see concurrency section, which their code does not implement). It did better than I expected, but I see no reason for using it in a practical setting when better policies are easier to implement.

glimpse ds1 s3 oltp

ben-manes commented 4 years ago

I took another look at this when running against the MSR traces, to compare against the paper's charts (page 7). Their simulator doesn't have a parser for this format, so I cannot be certain that mine is equivalent for the same trace file.

lhd

In the case of usr_1 the claimed hit rate at 256GB is between 80-90%. That exceeds an unbounded cache of 68% (a 32% miss rate), which is only compulsory misses without any evictions. Similarly proj_1 has an unbounded hit rate of 33.7%, yet they claim to achieve a hit rate of over 80%. This means that they could only achieve that hit rate if they prefetched the contents, e.g. it was warmed. That does not match the behavior in their simulator nor their description.

The numbers simply do not make any sense and I am unable to debug further because the authors are unresponsive. I cannot see how their hit rates claims are factual and, sadly, they may not be.

adamretter commented 4 years ago

@ben-manes thanks for the thorough investigation :-)