Closed martindisch closed 3 months ago
Hi @martindisch , first of all wow, thanks for the detailed explanation and repro, really appreciated!
Will look into this during the next few days, since as you pointed out the thing can be quite complex and we should be careful about not introducing different unwanted side effects.
I'm already planning for a new minor version soon, either v1.1 or v1.0.1, so I'd like to get to the bottom of this and include it.
Thanks again, will update soon.
ps: happy you are liking FusionCache!
Thanks so much, sounds good! We were also wondering if maybe it would be better not to pass the user-supplied external CancellationToken
into the factory for eager refreshes in general. For "blocking" cache accesses where the factory has to run in order to return the value (e.g. there was no cached value at all) it makes sense, these I would probably want to cancel if the HTTP request is cancelled. An eager refresh on the other hand I might want to continue regardless of whether the request that triggered it completes, because it's decoupled from it.
But there are probably many more considerations about what behavior is expected here, so it's likely not that simple.
Hi @martindisch , today I've been finally able to get to a point with this.
This is the test I created:
[Fact]
public async Task CanHandleEagerRefreshCancellationAsync()
{
var duration = TimeSpan.FromSeconds(2);
var lockTimeout = TimeSpan.FromSeconds(10);
var eagerRefreshThreshold = 0.1f;
var eagerRefreshDelay = TimeSpan.FromSeconds(5);
using var cache = new FusionCache(new FusionCacheOptions(), logger: CreateXUnitLogger<FusionCache>());
cache.DefaultEntryOptions.Duration = duration;
cache.DefaultEntryOptions.EagerRefreshThreshold = eagerRefreshThreshold;
// EXECUTE FACTORY
var v1 = await cache.GetOrSetAsync<long>("foo", async _ => DateTimeOffset.UtcNow.Ticks);
// USE CACHED VALUE
var v2 = await cache.GetOrSetAsync<long>("foo", async _ => DateTimeOffset.UtcNow.Ticks);
// WAIT FOR EAGER REFRESH THRESHOLD TO BE HIT
var eagerDuration = TimeSpan.FromMilliseconds(duration.TotalMilliseconds * eagerRefreshThreshold).Add(TimeSpan.FromMilliseconds(10));
await Task.Delay(eagerDuration);
// EAGER REFRESH KICKS IN
var eagerRefreshIsStarted = false;
var eagerRefreshIsEnded = false;
using var cts = new CancellationTokenSource();
var v3 = await cache.GetOrSetAsync<long>(
"foo",
async ct =>
{
eagerRefreshIsStarted = true;
await Task.Delay(eagerRefreshDelay);
ct.ThrowIfCancellationRequested();
eagerRefreshIsEnded = true;
return DateTimeOffset.UtcNow.Ticks;
},
token: cts.Token
);
// ALLOW EAGER REFRESH TO START
await Task.Delay(TimeSpan.FromMilliseconds(50));
// CANCEL
cts.Cancel();
// WAIT FOR THE BACKGROUND FACTORY (EAGER REFRESH) TO COMPLETE
await Task.Delay(eagerRefreshDelay.PlusALittleBit());
// GET THE REFRESHED VALUE
var sw = Stopwatch.StartNew();
var v4 = await cache.GetOrSetAsync<long>("foo", async _ => DateTimeOffset.UtcNow.Ticks, options =>
{
options.LockTimeout = lockTimeout;
});
sw.Stop();
Assert.Equal(v1, v2);
Assert.Equal(v2, v3);
Assert.True(eagerRefreshIsStarted);
Assert.False(eagerRefreshIsEnded);
Assert.True(sw.Elapsed < lockTimeout);
Assert.True(v4 > v3);
}
and it correctly catched the deadlock issue, which I can confirm is there.
By removing the token from the method as in your PR, it did in fact went away.
But, as you also mentioned in the follow up comment:
We were also wondering if maybe it would be better not to pass the user-supplied external CancellationToken into the factory for eager refreshes in general. For "blocking" cache accesses where the factory has to run in order to return the value (e.g. there was no cached value at all) it makes sense, these I would probably want to cancel if the HTTP request is cancelled. An eager refresh on the other hand I might want to continue regardless of whether the request that triggered it completes, because it's decoupled from it.
this is exactly was I was thinking, and not just for the case of eager refresh, but also for background factory completion with soft/hard timeouts, too.
But there are probably many more considerations about what behavior is expected here, so it's likely not that simple.
Exactly.
Currently I'm playing with a version that completely ignore the cancellation token for background factories (meaning both eager refresh + soft timeouts): this though, of course, would mean that there would not be any way at all to cancel such operations. Is this correct? It seems so, but... I'd like a little bit more time to make up my mind, and also see if there are other situations where this may have an impact.
I'll update you soon, meanwhile thanks a lot for spotting this and for the PR, great job!
Wonderful, thanks for the update and good idea to reproduce it more neatly as a test.
The issue of not being able to cancel background factories at all is indeed worth considering, even though I currently feel it's the better option compared to the risk of people inadvertently having their background factories frequently cancelled when passing the request's CancellationToken
. If it's difficult to anticipate the need for cancelling background factories, one approach would be to not support it for now and later retrofit it into the API surface in a non-breaking way by introducing a second CancellationToken
for just this purpose, if the feature is requested.
In any case, take your time to think about what you'd prefer. In our organization we currently recommend omitting the CancellationToken
for cache accesses to work around the issue, so we can easily wait for the next release.
and later retrofit it into the API surface in a non-breaking way by introducing a second CancellationToken for just this purpose, if the feature is requested.
I was thinking about kinda the same path, but with a difference: in case it would be needed again, I would not add a second CancellationToken
but an entry option, to allow the cancellation to flow through background factories.
Will update soon, thanks!
Hi @martindisch , the next v1.1.0
is almost ready, and it will contains the fix for this.
Will update asap.
Appreciate it, thanks!
Hi, I just release v1.1.0 🥳
We've been using FusionCache in several production services since last year and have really enjoyed its benefits, such as cache stampede prevention. Great job and thanks for this fantastic library!
Since the 1.0.0 update we've been seeing strange behavior with timeouts occurring reliably and precisely after the
Duration
expired on entries of a cache that was also configured to do eager refreshes. It took me a very long time to pin it down, but I've finally found the root cause and a small repro case.The problem starts with what I think is a slight misuse of the cache by us. We're passing the
CancellationToken
of the endpoint intoGetOrSetAsync
which makes sense in most cases, but is slightly unfortunate when this cache access leads to an eager refresh. As I found out, the request'sCancellationToken
will be cancelled once the client closes the connection, which may or may not (in case of persistent connections with keep-alive) happen before the eager refresh has completed in the background.If the token is cancelled, that's a problem. Since 1.0.0 the
ContinueWith
call on the factory task is also passed this token. This means if that's cancelled, the continuation won't run and release the memory lock.As a result, subsequent cache accesses think that the eager refresh is still occurring because of the occupied lock and as soon as the stale entry expires and the factory has to run, it will hang trying to acquire the lock.
I created PR #219 to stop passing the token to the continuation and have been able to verify that this fixes that particular issue of a cancellation in the factory being able to lock up the whole cache entry. However I'm not sure
I just don't know the codebase well enough to make that call. So feel free to update the PR accordingly.