ben-manes / caffeine

A high performance caching library for Java
Apache License 2.0
15.88k stars 1.6k forks source link

Expose method that invokes getIfPresentQuietly #229

Closed wburns closed 6 years ago

wburns commented 6 years ago

A while back Infinispan talked about asking for a peek method. At the time you gave some other ways [1] that didn't quite do what we needed. However looking at Caffeine recently I found that you have a method that does exactly what we need defined on the LocalCache interface [2]. The problem is that this interface is not exposed publicly.

Is is possible you could expose a way to invoke this method or something like it? Thanks.

[1] https://github.com/infinispan/infinispan/pull/4886#discussion_r102742968 [2] https://github.com/ben-manes/caffeine/blob/master/caffeine/src/main/java/com/github/benmanes/caffeine/cache/LocalCache.java#L84

ben-manes commented 6 years ago

It does become a rats nest in all the variations one might want of "quiet" methods. Does that mean recording with stats, eviction, expiration policies? Does that notify the writer / removalListener / etc? How many variations are supported (get, put, compute)? Does one one-offs, flags, or a flag builder?

I have tended to try to hide valuable but non-core methods in Cache.policy() and to a lesser extent Cache.asMap(). So if we could find a way of fitting it into the Policy then I'd probably be more open, since the ad hoc messiness is hidden from most users. There is a lot to be said of keeping the conceptual weight of an API small.

I am also unsure exactly what you need. Is it just a getIfPresent(key) that does not record a cache hit? If so, then the penalty is maybe polluting the cache, but we don't know the real impact to the hit rate without testing. The policy is fairly resistant to malicious requests, so if its just that then I wouldn't be too concerned about impacting the hit rate.

wburns commented 6 years ago

What we have internally is basically just a getIfPresent(key) that doesn't record a cache hit, as you mentioned.

This comes in a couple flavors, but we have some unrelated to user interaction and others that may require additional accesses per operations.

An example of unrelated to user interaction is we have an expiration reaper that runs every so often to remove expired entries. Due to nature of clusters, max idle is a very precarious setting. We have to query on all nodes to make absolutely sure that the entry is expired, before removing. In this case we just want to get at the entry and don't want to update the access policies as this isn't a real user interaction. In many cases this entry may be removed soon after, which can mess with recency based eviction algorithms, having an access and a removal so close to each other.

Also all of our write operations first perform a get on the underlying container, as we have various interceptors that may change the behavior of such operations. In this case we try to limit it so that write operations don't cause multiple access occurrences in the policy. In some case scenarios we may have up to 3 cache hits (2 reads and 1 write). I don't know if 3 hits like this right away might be promoting entries preemptively.

Unfortunately I can't read the document you linked to, but as I can see it seems the eviction algorithm should do well against malicious requests. We don't really have that, but just repeated accesses against the same key. I am just a bit worried about promoting keys more often then they should.

But I defer to you if you think these extra accesses shouldn't cause too much trouble.

ben-manes commented 6 years ago

I can send you the paper if you email me, since I probably shouldn't attach it. They ran malicious workloads that tried to pollute a cache and saw how well the policies coped. You could view non-user operations as similar noise, especially when full scans. As you said, these can have a greater impact on recency-biased policies than frequency ones. W-TinyLFU did quite well, so I would suspect that the impact wouldn't be too bad if you continue to perform the operations naively.

If helpful, also remember that asMap() has computation methods, so you can do a read and conditional remove as an atomic operation.

It might also be worth mentioning Caffeine's variable expiration. I'm not sure how much that helps beyond not having to scan the cache, since you still have to do the distributed validation. But it does provide a hook for more efficient processing.

I would ask that as a first step you might capture a problematic trace so that we can evaluate the impact. It just seems worth making sure we spend time on the right things rather than rely on our gut feelings, since both of our arguments seem reasonable at the moment.

I also think a fuzzy, best-effort approach is probably okay. Eviction and Expiration are fairly non-deterministic policies that clients shouldn't make hard assumptions about. That gives us a lot of leeway that usually masks many of these types of problems, which is why I am not overly concerned.

ben-manes commented 6 years ago

@wburns any thoughts on the above? I'm fine either exploring this with you or closing until later. If there isn't negative experiences and it is just a concern, then I'd prefer closing until we can confirm a problem. I am fine adding something to Policy someday if we conclude that is the right solution.

wburns commented 6 years ago

I wrote up something, but then abandoned it and got distracted :)

My only issue with a trace is that from the perspective of Caffeine vs Infinispan the hit % will be quite different. Since we will be polluting the Caffeine side with extra gets, it may get an inflated hit %. I could generate something from the perspective of Infinispan in that case though. And tbh with you one of the problems is I am a bit disconnected from users workload to generate a trace at this point. Although I wonder if the person from https://issues.jboss.org/browse/ISPN-9023 could eventually share theirs :) I am still not sure why memory based (ie. weighted) eviction is giving him troubles.

At this point my worry is more about correctness and as you said is more of a gut feeling. As such you can probably just close this.

The one benefit as I have said to others and you even mentioned is that eviction is not deterministic in regards to what entries are kept and we just need to verify we have the correct number in the cache.

ben-manes commented 6 years ago

Oh, I remember that issue! Its unfortunate that he didn't respond to requests for a reproducible test for us to investigate with.

Technically we would want a trace from Infinispan's (user facing) and Caffeine's (internal) perspective. Then when we replay and the hit rates differ significantly, we know that the extra accesses are confusing the policy. Then it is a lot easier to decide on a fix. If, on the other hand, the two hit rates are similar then we know there isn't a problem.

In Caffeine we have 3 regions: admission window, probation, and protected. The admission window is LRU so that the entry's recency has to age. If the inflated hits cause it to pass TinyLFU's filter, it is in probation. If it isn't used again, it will be aged out. The worst case is it is inflated after the TinyLFU filtering, which will cause it to be in the protected region and take the longest to be removed. But that means it was pretty useful in the past, so the harm shouldn't be too high.

The problems we have seen are with recency-biased traces where frequency is a negative signal. We have experimental ideas for an adaptive policy that resizes the region sizes. I can send you a pre-published paper on those ideas so far. I'd love some help here if you get bored 😉, since its still an open problem.

Maaartinus commented 6 years ago

I may be talking nonsense, but no risk no fun, so bare with me.

In some case scenarios we may have up to 3 cache hits (2 reads and 1 write). I don't know if 3 hits like this right away might be promoting entries preemptively.

The original bug report reports the pattern MISS-HIT-HIT-HIT when it works correctly. This mean that there are three hits to the wanted entry. When infinispan also accesses the entry three times, possibly much later, then it theoretically makes sense to keep the old entry instead of replacing it as it also leads to three hits.

This doesn't explain the difference between count and size based eviction. However, it might explain how the late access by infinispan can keep an entry needlessly. I guess that for this access, it doesn't matter if it hits or misses, so getIfPresentQuiet may solve the problem.

On the second though, it's probably a nonsense. The cache is big enough for 100 entries, so even when the region is just 1%, the most recently read entry must be kept for the next three reads.

ben-manes commented 6 years ago

Yep. The admission window has a minimum size of 1 to avoid the surprise of being immediately evicted. If those are consecutive accesses to the same entry, then it will have to be present.

The author was trying to cache images, saying they are ~100kb each. I'm not sure what the unit is for <binary eviction="MEMORY" size="1000000" />, but that would be only 10 entries if in bytes (1 MB). This would be 10x smaller than the other configuration, <binary eviction="COUNT" size="100" />. That would explain the different hit rates if a misconfiguration.

Another possibility is that some images are much larger than 100kb. If the cache's capacity is smaller than the item's, then it will evict that item immediately. There is no point flushing the entire cache in that case. So he might have a few large image and that impacts the hit rate, but doesn't demolish it.

Unfortunately without his test case we don't know.

ben-manes commented 4 years ago

Released 2.8.3 which includes cache.policy().getIfPresentQuietly(key) which skips updating any cache metadata.