Does a caching algorithm benefit from a cost function, if there is a significant difference in retrieval cost of the cached items?

layoutanalysis commented 5 years ago

We plan to implement a continouus reload feature for our cache in an OLAP application(see https://github.com/ben-manes/caffeine/issues/338). Some of the cached database queries have response times in the seconds/minute range, others (< 5%) might take hours. By default, Caffeine's TinyLFU algorithm would only consider frequency and recency of the cached items. Should we also take the query cost into consideration or even switch to a cost-based algorithm?

I've checked related discussions in the other issues here and came up with a list of pro's and con's:

PROs:

larger gains in load time for users possible (if caffeine reloads a "costly" 30-minute query instead of a fast 1-minute query)
better control of the database load triggered by cache reloads

CONs:

no consensus among the users whether costly-to-fetch items deserve special treatment (if they don't qualify for the reloading for other reasons)
hard to specify a universally valid cost function (database response times can vary depending factors)
requires choosing a different caching algorithm or even "patching" it.

What is the current research status on this topic?

ben-manes commented 5 years ago

This will be a bit of a long answer, sorry for that.

What is the current research status on this topic?

There is very little actionable research in factoring the "cost" of an entry into the eviction decision. It is not only a hard problem in itself, but there is very little public data to work from. The papers on the subject often use private traces, and the few public traces are very old (e.g. SNIA strongly warns to stop using theirs as invalid, yet researchers still do). This means it is difficult to test theories, papers cannot be easily validated, and good results are not representative due to unintentional cherry-picking.

Direct algorithms

The most common "cost" factor is by using the entry's size and then optimizing for the byte-hit-rate. This is attractive for file serving, e.g. CDNs, which is what those papers tend to focus on. The GDS family of policies (GSDF, GD-Wheel, GDS-LC, CAMP, etc) are the leaders in this area and worth reading into.

There are some newer policies, like LHD and AdaptSize, but their performance varies significantly in the public traces. In the public CDN trace used by AdaptSize, its slightly worse than Caffeine which does not use size. LHD matches GDS, but since the trace is small this is not a valid since it does not trigger their adaptivity (whereas DS1 is large and LHD underperforms). LHD is also very slow, as most academic work focuses on an idea and may be impractical for real usage (as later insights may lead to practical versions).

@ohadeytan has been exploring how to enrich TinyLFU with size/cost awareness. I believe the lack of traces has been a big frustration as he has some interesting ideas but it is hard to validate them. For my part of improving TinyLFU it was an iterative process of playing with ideas until stumbling on the right one, dissecting it to understand why it works, and refining it. Any traces you can provide would be beneficial.

ML-based

The recent trend is to try and use ML with modest success. I think this is promising and will be fruitful in a few iterations. There are a few papers like RL-Cache, LeCar, DeepCache, FFN.

Most policies work on a single dimension and may try to infer a second one. For example ARC and LIRS work by recency and try to infer frequency based on their recency lists. TinyLFU is frequency with LRU lists to infer recency. We took that a step further using Hill Climbing to tune how long to delay applying TinyLFU's filtering based on the hit rates, which adapts to a workload specific recency vs frequency balance.

When you want to optimize multiple dimensions (frequency, recency, entry size, latency, etc) then this is where ML shines. In particular, Reinforced Learning is the topic to explore. I have recently been starting Sutton's book, which you might enjoy.

I believe the structure used by Caffeine is the right one (window / admittor / main; adaptivity). Then enriching the admittor to take into account the different signals should result in a very good predictor. This is the direction that RL-Cache is going towards making it look promising to me. Their code is available but I have not tried it yet.

A flaw with ML algorithms is that they can be expensive (slow training, large footprint). This makes them a nicer fit for CDN traces given the large traces and offline training. Hopefully that can gradually become applicable to the general case using online training for small to large caches.

Workarounds

There is nothing Caffeine will do on this front, so unfortunately it's an interesting topic but won't be solved here. Since this is an in-memory local cache, you might consider a remote cache like memcached or redis. That won't solve the algorithmic problem, but give you more space to cache and multiple servers can reuse the results. For optimal speed, you can then use Caffeine as a local layer that looks up & populates the remote.

layoutanalysis commented 5 years ago

Thank you for the detailed response, it helped a lot!

ben-manes commented 4 years ago

@layoutanalysis I'm curious about your CON,

hard to specify a universally valid cost function (database response times can vary depending factors)

What factors are at play and is there any logical way to partition the keys? For example if you had multiple external data stores and some were very slow, you could partition the keys by that backend. If you gave each backend its own cache, then a meta-layer like Robinhood could dynamically re-allocate between them.

If its opaque by the time it reaches the caching layer, then the limitations discussed previously apply. If not, then you might be able to dynamic rebalance multiple caches based on those additional insights.

ben-manes commented 3 years ago

@layoutanalysis do you think that you could help me acquire access traces for this type of workload? I’d like to investigate cost aware policies but there is not much available to analyze.

I suspect that a reasonably good approach is to use Tinylfu and admit by comparing cost*freq of the candidate & victim. While perhaps not as perfect as a GDS-based policy, I think it would come close with much less overhead. Some workloads to judge with would be invaluable.

ben-manes / caffeine