[BUG] Failsafe seems to always trigger and doesn't seem to expire

ZiggyCreatures / FusionCache

FusionCache is an easy to use, fast and robust hybrid cache with advanced resiliency features.

MIT License

1.9k stars 97 forks source link

[BUG] Failsafe seems to always trigger and doesn't seem to expire #108

Closed fkuhnert closed 1 year ago

fkuhnert commented 1 year ago

Describe the bug On my current setup, when using a FusionCache with IsFailSafeEnabled = true, the fail-safe is always active and ignores the FailSafeMaxDuration value.

A value is being set, and even after it expires both on the regular cache and on the fail-safe (per the FailSafeMaxDuration value), every GetOrSet call is defaulting to the fail-safe.

I am currently using the FusionCache to cache some configs that are obtained remotely through an API. I am not using any distributed caches, only memory. The program tries to reload the IConfiguration every 5 seconds, and each reload call uses a GetOrSet on an Object that has an IConfiguration inside it.

The code is as follows:

return await _configCache.GetOrSetAsync($"tenant/{tenantId}",
                async _ => new Tenant { Id = tenantId, Config = await GetTenantConfigs(tenantId) });

Also, my _configCache object is configured as follows:

_configCache = new FusionCache(new FusionCacheOptions
{
    DefaultEntryOptions =
    {
        Duration = TimeSpan.FromSeconds(10),
        IsFailSafeEnabled = true,
        FailSafeMaxDuration = TimeSpan.FromSeconds(20),
        FailSafeThrottleDuration = TimeSpan.FromSeconds(1),
        FactorySoftTimeout = TimeSpan.FromMilliseconds(100), // I tried with both 100 and 1000 and both produced the same result
    }
});

What I expected from the fail-safe is that it would trigger only if something inside the Factory (async _ => new Tenant { Id = tenantId, Config = await GetTenantConfigs(tenantId) }) threw an exception, timed out or something similar. However, the GetTenantConfigs function is called once only, and each subsequent GetOrSet being called gets the expired value, even after the 20 second failsafe duration has passed.

I'm pretty sure this is an issue with the fail-safe because once it is turned off, the cache works perfectly as expected (GetTenantConfigs gets called every 10-15 seconds).

The factory isn't throwing any exceptions or timeouts.

To Reproduce

Use the above settings on the FusionCache
Call a function inside the Factory

Expected behavior Since the cache duration is set to 10 seconds, it expires after 10 seconds, so if you call GetOrSet every 5 seconds, it should call the function inside the factory every 10-15 seconds. The failsafe is never triggered since no exceptions are thrown.

Current behavior The function is called once. The cache is never renewed, even after the FailSafeMaxDuration has passed.

Versions I've encountered this issue on:

FusionCache version 0.18.0
.NET version 6.0.111
OS: Ubuntu 22.04

jodydonetti commented 1 year ago

Hi @fkuhnert and thanks for reporting this.

I'll be able to take a look at this during the weekend, not sooner, but meanwhile I'd like to suggest you to create a minimal console app or similar with just the above call, enable logging with a min level of Trace (the lowest level) and see what FusionCache tells you during the GetOrSet call.

With logging enabled, FusionCache can log a lot of internal details, like "I'm trying to acquire a lock" or "I've decided to enable fail-safe because whatever" and similar, so it gets easier to do some detective work to understand what is going on.

Regarding this code:

return await _configCache.GetOrSetAsync($"tenant/{tenantId}",
                async _ => new Tenant { Id = tenantId, Config = await GetTenantConfigs(tenantId) });

what code can I use as the implementation of the GetTenantConfig() method? Do you have a stub I can use that works similarly? I'd like to have something realistic on my side, to reproduce this locally.

Let me know so I can investigate this, and will let you know!

Thanks!

fkuhnert commented 1 year ago

Okay, apparently this was an error on my behalf, I apologize :P

In the original program, I wrote some separate logic that only called GetOrSet on the cache if it was both expired and had a valid tenant. I wrote it as follows:

TryGetAsync on the object, if it's there just return straight away
If TryGetAsync fails, check if the identifier is on the list → never fails because it gets the expired value
If it's on the list, call GetOrSet, if not throw an exception

Apparently, TryGetAsync gets expired cache values just fine. Is there a way to prevent this? As in, TryGet only returning true if the value is unexpired, while also using the failsafe? Please let me know if this is just plain incorrect usage of the FusionCache :P

One thing that is confusing me is the MaxDuration on the failsafe isn't exactly being respected. With the previous (wrong) code, it logs this every time after the regular cache expires, even after the failsafe duration (20 seconds) expires:

dbug: ZiggyCreatures.Caching.Fusion.FusionCache[0]
      FUSION (O=fc8f91ffbde74096b73e5cc7463d9c9f K=tenant/prix): calling TryGetAsync<T> (null)
dbug: ZiggyCreatures.Caching.Fusion.FusionCache[0]
      FUSION (O=fc8f91ffbde74096b73e5cc7463d9c9f K=tenant/prix): memory entry found (expired) FE[FFS=Y LEXP=-2s]
dbug: ZiggyCreatures.Caching.Fusion.FusionCache[0]
      FUSION (O=fc8f91ffbde74096b73e5cc7463d9c9f K=tenant/prix): saving entry in memory MEO[CEXP=20s PR=N S=1] FE[FFS=Y LEXP=3s]
dbug: ZiggyCreatures.Caching.Fusion.FusionCache[0]
      FUSION (O=fc8f91ffbde74096b73e5cc7463d9c9f K=tenant/prix): return SUCCESS

Per the line saving entry in memory MEO[CEXP=20s, maybe it is saving the expired value back on the memory and treating it as valid?

jodydonetti commented 1 year ago

Okay, apparently this was an error on my behalf, I apologize :P

Ahaha, no worries 😀 !

Apparently, TryGetAsync gets expired cache values just fine. Is there a way to prevent this?

Try to disable fail-safe in the TryGetAsync call: fail-safe is (also) used as a way to enable fallback on stale data even without a factory (so like a TryGetAsync method). Honestly? I'm thinking about removing that, because it doesn't feel 100% right to do that if there's not a factory that "fails". At the same time though it can make sense to be able to have the fallback on stale data even in a TryGetAsync call, if one like. The idea I'm currently playing with is to keep using fail-safe as a way to fallback on stale data, even in a TryGetAsync call, BUT with the important distinction that, since it is not a "set" call, it would only RETURN stale data (again, only if fail-safe is enabled) and not SAVE it, even just temporarily.

Give me some more time and I will update you when I'll have my mind cleared about this.

One thing that is confusing me is the MaxDuration on the failsafe isn't exactly being respected. With the previous (wrong) code, it logs this every time after the regular cache expires, even after the failsafe duration (20 seconds) expires:
FUSION (O=fc8f91ffbde74096b73e5cc7463d9c9f K=tenant/prix): saving entry in memory MEO[CEXP=20s PR=N S=1] FE[FFS=Y LEXP=3s]
Per the line saving entry in memory MEO[CEXP=20s, maybe it is saving the expired value back on the memory and treating it as valid?

I will double check this and will let you know, but I can already see one thing: you are looking at CEXP=20s, but when fail-safe is enabled what matters (in a way) is what follows, FE[FFS=Y LEXP=3s]. This is saying that the cache entry being re-saved in the cache comes from a fail-safe activation, and that the Logical EXPiration (LEXP) is only 3s, and not 20s. This is because the entry will stay in case for 20s (FailSafeMaxDuration) to not disappear from memory, BUT it will be considered "logically expired" after jsut 3s, so that a new factory execution will be triggered.

Hope this helps, I will double check this asap and will let you know: meanwhile let me know what you think with these updated informations.

Thanks!

jodydonetti commented 1 year ago

Hi @fkuhnert , have you had some time to think about this?

In the next release I'm about to push out tomorrow I think I'll do what I highlighted above, this:

The idea I'm currently playing with is to keep using fail-safe as a way to fallback on stale data, even in a TryGetAsync call, BUT with the important distinction that, since it is not a "set" call, it would only RETURN stale data (again, only if fail-safe is enabled) and not SAVE it, even just temporarily.

So basically I will keep returning the stale value when calling TryGet[Async] with fail-safe enabled, BUT I will stop saving it, since it's not a "set" call.

Can you please let me know what do you think about this approach?

Thanks!

jodydonetti commented 1 year ago

The new v0.19.0 has been released, which includes this 🎉

fkuhnert commented 1 year ago

Hey! Sorry for taking so long to write an answer.

I think this approach on the stale data TryGet looks great. I'll try it out and see if it works on my use case!

Thanks a lot!

jodydonetti commented 1 year ago

Great, thanks!

tmenier commented 1 year ago

I understand this is a closed issue but I just ran into it myself while transitioning from MS's default caching. (Loving the library btw, thank you for your efforts!) FailSafe seems like a cool feature so I enabled it, but the fact that TryGet[Async] returned stale data by default was a surprise.

Have you considered an overload (or optional param) such as TryGet[Async](string key, bool allowStale)? Although I can see value in allowing stale data to be served from this method, I would be inclined to default it false. IMO, it's the least surprising behavior. But of course that makes it a breaking change, so perhaps a consideration for a future major version.

jodydonetti commented 1 year ago

Hi @tmenier , thanks for your words, I appreciate them.

To not use stale data, even in a TryGet call, simply disable fail-safe for that specific call.

Let me know if it worked.

tmenier commented 1 year ago

Yes, I should have mentioned that's what I ultimately did. I was mostly just highlighting the point that serving stale data from that method by default seems unexpected, IMO. Sort of ignores the "Fail" in "FailSafe" 😃. Addressing it more explicitly via a hypothetical allowStale, as opposed to disabling a feature, feels a bit more natural here.

But to your point, I am not blocked - the workaround works perfectly. Thanks!

jodydonetti commented 1 year ago

But to your point, I am not blocked - the workaround works perfectly. Thanks!

That is good to know, thanks to you!

I was mostly just highlighting the point that serving stale data from that method by default seems unexpected, IMO. Sort of ignores the "Fail" in "FailSafe" 😃. Addressing it more explicitly via a hypothetical allowStale, as opposed to disabling a feature, feels a bit more natural here.

Eh, I know... to be honest I'm thinking about this from some time.

Because as you said the "fail" in "fail-safe" implies some sort of failing, which is not the case here. Also yes, sure, as we said there's a way to control it but at the same time it would be nice to enable fail-safe in the default entry options once and just forget about it (meaning: no need to specify it in every call). Problem is, if we then need to do a TryGet/GetOrDefault and don't want a stale value we would need to specify IsFailSafeEnabled = false every time, which is a little bit less nice.

Let me try to explain my rationale about it, so you may let me know what you think ok?

I thought about adding something like a new entry option bool AllowStaleData on the FusionCacheEntryOptions object, but if on one hand it would be then clear that IsFailSafeEnabled would NOT apply to TryGet/GetOrDefault (because there's no "fail" there), on the other hand it would not be so clear if AllowStaleData would apply or not to GetOrSet. Like, someone using FusionCache would wonder how AllowStaleData impacts a GetOrSet call, something along the lines of "if I set it to false would stale data be ignored when calling GetOrSet?".

I don't know if I've been able to explain myself clearly.

But your suggestion here is not to add a new option (which is the thing I was thinking about), but to add an extra param to the TryGet/GetOrDefault methods, so that would only be local to those methods, which can make sense. It would also require either a breaking change in the existing methods (meh) or some new overloads with the added param.

At the same time though I would like to avoid adding an extra single purpose param, because: 1) tomorrow a new need may come up and we'll be all over again, adding a new overload with a new param, etc 2) the single purpose param would not allow us to benefit from the concept of default entry options

So I'm here, stuck while trying to find the best solution that would balance all of these things together 😅

🙋‍♂️ If you can, let me know what you think, that would be helpful.

Thanks!