ZiggyCreatures / FusionCache

FusionCache is an easy to use, fast and robust hybrid cache with advanced resiliency features.
MIT License
1.9k stars 97 forks source link

[FEATURE] πŸ”€ Add `SkipDistributedCache` option #94

Closed jodydonetti closed 1 year ago

jodydonetti commented 1 year ago

Scenario

With FusionCache it's very easy to handle 2 cache levels at the same time. The 1st level is a memory cache while the second is a distributed cache: we just specify a distributed cache during the initial setup and FusionCache will automatically coordinate the dance between the 2 layers, in an optimized way, without us having to do any extra activity.

Problem

Sometimes though, when using a 2nd layer, we may need to not use it, only for a couple of specific method calls. Today this is not possible: if you have setup a 2nd layer, it will be used in all method calls.

Some members of the community also asked for this.

Solution

The solution would be to add a new option in the FusionCacheEntryOptions class, namely SkipDistributedCache: when set to true the 2nd layer (distributed cache) will be totally ignored (it defaults to false).

On top of the option itself, an additional SetSkipDistributedCache(...) method on the FusionCacheEntryOptions class should be added to set the option value in a fluent-interface way, like the other already existing methods.

Backplane care

A particular point of interest is the use of a backplane without a distributed cache, because if we're not extra careful that combo may lead to unexpected results (see here).

When using a distributed cache + a backplane everything is fine, but by introducing the ability to skip the distributed cache for specific calls we are also introducing the possibility of unintendedly falling into that trap for those calls.

Because of this, the idea is for the SetSkipDistributedCache(...) method to have 2 bool params: one to drive the new option, and one to drive the option to also skip backplane notifications.

By using the common method to change the option, the user will also have to think about the consequences of that, and ideally avoid mistakes.

aKzenT commented 1 year ago

Í would add to this, that it would actually be helpful to be able to also skip the memory cache and go directly to the source for a few selected scenarios, basically having something like a forced refresh if I know from external information (e.g. webhook) that the caches are stale.

I could just get the information myself and call "Set" of course, but that would mean that I loose all other benefits that FusionCache gives me like avoiding cache stampedes.

jodydonetti commented 1 year ago

The new v0.18.0 has been released πŸŽ‰

It includes this new feature, too.

jodydonetti commented 1 year ago

I could just get the information myself and call "Set" of course, but that would mean that I loose all other benefits that FusionCache gives me like avoiding cache stampedes

Hi @aKzenT , I'm not following you here.

Cache stampede is only needed when doing a get+set sequence, since it involves the execution of the factory when a lot of callers are requesting a piece of data. If instead you execute the factory and call a Set normally, it would not be needed.

The other features of FusionCache (soft/hard timeouts, backplane notifications, background execution on distributed cache operations, events, etc...) are all normally available during any other available operation, so you'll not loose anything.

Am I missing something here? Do you have maybe a practical example so that I can understand better your scenario?

aKzenT commented 1 year ago

What I'm trying to say is that in some situations I might know for some reason that the caches are no longer valid. For example a webhook was called to inform me that the source has been updated or I have a user interface where the user can trigger a "Refresh". It is true that I could just call the factory myself and "Set" the data. If I'm not careful this could result in a stampede if the trigger is called a lot of times in a short time and I call the factory directly every time.

Thinking about it however, maybe a better solution to the problem would be a way to programmatically mark an entry as stale (in Level 1 or Level 2 or both) so that the next time the item is requested, it will go directly to the next higher level and also prevent cache stampedes etc. Or alternatively a "Refresh"-Method that basically does a get from factory + Set while ensuring that only one refresh for the same key is running at the same time.

Personally I came upon such a scenario while trying to implement a functionality for flushing the second layer cache but preferably not having to actually delete the entries from the cache completely before ensuring that a fresh version from the source is available (i.e. keeping the cache as a fallback value).

jodydonetti commented 1 year ago

Ok, now I get it, thanks.

FusionCache already has a way to mark an entry as expired (without actually deleting it) so that it can be used as a fallback for future failsafe-enabled operations, it's just that it's private as of today. In fact such method is what it's currently being used by the backplane when it receives a notification from another node.

I always thought about the possibility of making it public, but I wanted to wait until a concrete use case arrived, to avoid making it public when it was not actually needed (and maybe choosing the wrong name or signature).

I think the time has finally come 😬

So yeah, I'll open an issue to track the design and considerations of it, stay tuned!

jodydonetti commented 1 year ago

Hi @aKzenT , I was thinking again about this, and have some considerations to share with you.

What I'm trying to say is that in some situations I might know for some reason that the caches are no longer valid. For example a webhook was called to inform me that the source has been updated or I have a user interface where the user can trigger a "Refresh".

Ok, makes sense.

It is true that I could just call the factory myself and "Set" the data. If I'm not careful this could result in a stampede if the trigger is called a lot of times in a short time and I call the factory directly every time.

Well if I got this right, the answer is "yes" and "no": let me explain.

Let's say that a webhook is called, and that in turn calls a factory and set the data: the node upon which the factory has been called will have its local memory cache updated, and the same is true for the distributed cache. But all the other nodes will just be notified (of course assuming you are also using a backplane) of the fact that a change has happened for that cache entry, but they would NOT go to the db or whatever to get the data: that would happen only IF and WHEN that cache entry will be requested in those nodes, and in that case they would find the fresh entry in the distributed cache.

Of course if there are something like 10 webhook calls in rapid succession all of them would go to the database, that is true. But, even in the case of the hypothetical new "Evict" method, the problem would kinda remain because all of those calls would need to talk to the distributed cache nonetheless (to tell it "hey, this entry is now stale!"), so the only "skipped" calls would be the ones towards the db.

Now - and this is key - it's also true that a stampede typically happens because a ton of requests to GET the same piece of data arrive at the same time (realistic access pattern), whereas here in your scenario we are talking about a ton of requests to UPDATE the same piece of data at the same time, which I don't really know if is something realistic.

πŸ™‹β€β™‚οΈ Do you have maybe any real-world experience about this scenario?

Thinking about it however, maybe a better solution to the problem would be a way to programmatically mark an entry as stale (in Level 1 or Level 2 or both) so that the next time the item is requested, it will go directly to the next higher level and also prevent cache stampedes etc.

As I said above, FusionCache already have such a method, even though only internally (it's not public): but because of the needs it covers, it only works on the memory cache, and not on the distributed cache. Extending the behavior to also cover the distributed cache would mean for FusionCache to get the cache entry from the distributed cache + changing the expiration + setting it again so, in the end, this would need 2 distinct calls to the distributed cache (because we can rely only on the available methods on the core abstractions IMemoryCache and IDistributedCache).

πŸ™‹β€β™‚οΈ Having said all of this, do you think this would be a good design decision?

Or alternatively a "Refresh"-Method that basically does a get from factory + Set while ensuring that only one refresh for the same key is running at the same time.

In theory I can see this as a new overload of the existing Set method, where instead of passing the value to set directly it would pass a factory (lambda) like in the GetOrSet methods (where you can already pass either a direct value or a factory): in this case the same cache stampede prevention mechanism may be applied.

Something like this (simplified code):

// EXISTING
public void Set<TValue>(string key, TValue value, ...);

// NEW
public void Set<TValue>(string key, Func<TValue> factory, ...);

But I can already see one potential problem with this approach though, in the form of some (nuanced) unexpected behaviors. This is because the call would basically mean "IF nobody else is already setting this cache entry right now, call the factory and set the result, otherwise ignore it".

Let's make an example.

After 10 Set calls with a direct value (the existing one), in the end the value in the cache would be the one of the LAST call, which is the expected behavior (the last "write" call will overwrite all the previous values).

Instead after 10 Set calls with a factory (the supposedly new one), in the end the value in the cache would not necessarily be the one of the last call, but the one of the FIRST call in the LAST locking window (which, depending on the timing of the calls, may even be the very first call at all) which as said can be VERY unexpected. I mean, I can already see reactions like "hey, I made a Set call and FusionCache totally ignored it!".

Instead, in a GetOrSet call, it is implicit that the "set" part may NOT be called, whereas here it would not be that clear.

Even using a different name (like TrySet(...) or something like that) may not be a solution.

So my fear is that this may generate more problems that what it would help solving.

πŸ™‹β€β™‚οΈ Opinions?

Personally I came upon such a scenario while trying to implement a functionality for flushing the second layer cache but preferably not having to actually delete the entries from the cache completely before ensuring that a fresh version from the source is available (i.e. keeping the cache as a fallback value).

Makes sense, but as I highlighted above to do that you would need to GET + change + SET from the distributed cache, so 2 distributed operations (again because the core abstractions do not have such methods available) an so I'm wondering if that would actually be that good.

πŸ™‹β€β™‚οΈ What do you think?

Thanks!

aKzenT commented 1 year ago

Now - and this is key - it's also true that a stampede typically happens because a ton of requests to GET the same piece of data arrive at the same time (realistic access pattern), whereas here in your scenario we are talking about a ton of requests to UPDATE the same piece of data at the same time, which I don't really know if is something realistic.

πŸ™‹β€β™‚οΈ Do you have maybe any real-world experience about this scenario?

We use Contentful as a CMS for example which for preview/staging triggers a webhook basically with every keystroke. Taking into account multiple editors working at the same time and considering that one webhook call can result in multiple entries being invalidated and that requesting new content from the factory can take several seconds, this can result in some congestion. However, you are right that this causing a stampede resulting in an outage would be rare. It would however still be nice if this scenario is something that FusionCache could optimize instead of having to built it ourself. Or at least for FusionCache to provide this kind of abilities for custom plugins (see #98). Depending on how this would be implemented and what method would be used, I would also expect subsequent Get calls to wait for the updated result just as they normally would. Calling the factory myself it would not be possible to configure how Gets for the same key should behave.

As I said above, FusionCache already have such a method, even though only internally (it's not public): but because of the needs it covers, it only works on the memory cache, and not on the distributed cache. Extending the behavior to also cover the distributed cache would mean for FusionCache to get the cache entry from the distributed cache + changing the expiration + setting it again so, in the end, this would need 2 distinct calls to the distributed cache (because we can rely only on the available methods on the core abstractions IMemoryCache and IDistributedCache).

πŸ™‹β€β™‚οΈ Having said all of this, do you think this would be a good design decision?

I see your point here. With a thin abstraction over IMemoryCache and IDistributedCache it is dificult to implement this without having to do expensive calls. Using more complex aproaches you could make this cheaper, e.g. by storing an expiration timestamp in a separate key which would of course also need to be read on every get, but which could probably be done relatively cheap at least in redis because of automatic batching. For our scenario it would even be enough to have a single stale marker for the entire cache group as we tend to flush the complete group instead of individual entries. So I could have a key storing the last flush date for the group and I would consider all entries stale on Get when they are older than this flush date.

But I understand that these are things that might not be straightforward enough to implement in the core library. Unfortunately as explained before the current plugin library also makes it impossible to build it as a custom plugin. What I would need would be access to the internal cache instances, hooks to intercept Get/Set calls in which I can make additional calls and influence whether an entry is considered stale or not, access to the backplane to send custom messages to other nodes like "flush timestamp was updated", etc. In this sense, plugins could also work as a kind of incubator for things to try out before they enter into the core library.

Besides that I think the Refresh-Method would still be worthwhile as an alternative or interim solution as in my opinion it should be easier to implement. It would also be something that would be very useful to have for plugins like the one just described.

Instead, in a GetOrSet call, it is implicit that the "set" part may NOT be called, whereas here it would not be that clear.

Even using a different name (like TrySet(...) or something like that) may not be a solution.

So my fear is that this may generate more problems that what it would help solving.

πŸ™‹β€β™‚οΈ Opinions?

That is why I proposed the "Refresh" call to make it clear that it is distinct from Set. However, after some thinking, maybe having this as an option on GetOrSet would make the most sense after all for the following reasons:

  1. The semantics would be the same as GetOrSet. You do not have a guarantee that the factory is called.
  2. It would be clear that subsequent gets would queue and wait for the factory to finish. No need to explain the interactions between different methods.
  3. Since we will call the factory and will have the result of the factory, we can just return it to the caller without any additional cost. So instead of just discarding the factory result after storing it in the cache, we can support more scenarios by returning it.

That being said, there is one scenario that might need consideration. If we really have 10 calls in a short amount that specify a "SkipCache" flag, one after the other and we are assuming that this is the result of some outside activity to invalidate the cache because of changes, it might not be desired result to just accept the result of the first call and return that for subsequent calls that happened during factory retrieval as this result may already be stale again. So for correctness, when specifying "Skip=true" you could only wait for calls that have not yet been made, not for calls that are in progress. So if you have 10 calls all happening in a very short amount, you would need at least two calls, one for the very first request and another one that is queued and executed after the first call finished and which can satisfy all request that have been made until then. While that may sound complicated, I think it could be implemented relatively straightforward. Let me know if you want to discuss details on how it could work internally.

jodydonetti commented 1 year ago

Hi @aKzenT , I just dropped a new issue for a new Expire() method.