[BUG] Deadlock caused by cancellation of eager refresh

martindisch commented 3 months ago

We've been using FusionCache in several production services since last year and have really enjoyed its benefits, such as cache stampede prevention. Great job and thanks for this fantastic library!

Since the 1.0.0 update we've been seeing strange behavior with timeouts occurring reliably and precisely after the Duration expired on entries of a cache that was also configured to do eager refreshes. It took me a very long time to pin it down, but I've finally found the root cause and a small repro case.

using Microsoft.AspNetCore.Mvc;
using ZiggyCreatures.Caching.Fusion;

var builder = WebApplication.CreateBuilder(args);
builder.Services
    .AddLogging(c => c.AddSimpleConsole(o => o.TimestampFormat = "[HH:mm:ss] "))
    .AddFusionCache().WithDefaultEntryOptions(new FusionCacheEntryOptions
    {
        Duration = TimeSpan.FromSeconds(10),
        EagerRefreshThreshold = 0.1f,
    });
var app = builder.Build();

app.MapGet("/hello/{name}", async (
    string name,
    [FromServices] IFusionCache fusionCache,
    ILogger<Greeting> logger,
    CancellationToken cancellationToken) =>
{
    return await fusionCache.GetOrSetAsync(
        name,
        factoryToken => GetGreetingAsync(name, logger, factoryToken),
        token: cancellationToken);
});

app.Run();

static async Task<Greeting> GetGreetingAsync(string name, ILogger<Greeting> logger, CancellationToken cancellationToken)
{
    logger.LogInformation("Executing factory for {Key}", name);
    await Task.Delay(TimeSpan.FromMilliseconds(100), cancellationToken);
    logger.LogInformation("Finishing factory for {Key}", name);

    return new Greeting($"Hello, {name}!");
}

public record Greeting(string Text);

The problem starts with what I think is a slight misuse of the cache by us. We're passing the CancellationToken of the endpoint into GetOrSetAsync which makes sense in most cases, but is slightly unfortunate when this cache access leads to an eager refresh. As I found out, the request's CancellationToken will be cancelled once the client closes the connection, which may or may not (in case of persistent connections with keep-alive) happen before the eager refresh has completed in the background.

If the token is cancelled, that's a problem. Since 1.0.0 the ContinueWith call on the factory task is also passed this token. This means if that's cancelled, the continuation won't run and release the memory lock.

Screenshot of code

As a result, subsequent cache accesses think that the eager refresh is still occurring because of the occupied lock and as soon as the stale entry expires and the factory has to run, it will hang trying to acquire the lock.

I created PR #219 to stop passing the token to the continuation and have been able to verify that this fixes that particular issue of a cancellation in the factory being able to lock up the whole cache entry. However I'm not sure

if this introduces additional undesired behavior
if there are any other locations in the code susceptible to the same class of problems

I just don't know the codebase well enough to make that call. So feel free to update the PR accordingly.

jodydonetti commented 3 months ago

Hi @martindisch , first of all wow, thanks for the detailed explanation and repro, really appreciated!

Will look into this during the next few days, since as you pointed out the thing can be quite complex and we should be careful about not introducing different unwanted side effects.

I'm already planning for a new minor version soon, either v1.1 or v1.0.1, so I'd like to get to the bottom of this and include it.

Thanks again, will update soon.

ps: happy you are liking FusionCache!

martindisch commented 3 months ago

Thanks so much, sounds good! We were also wondering if maybe it would be better not to pass the user-supplied external CancellationToken into the factory for eager refreshes in general. For "blocking" cache accesses where the factory has to run in order to return the value (e.g. there was no cached value at all) it makes sense, these I would probably want to cancel if the HTTP request is cancelled. An eager refresh on the other hand I might want to continue regardless of whether the request that triggered it completes, because it's decoupled from it.

But there are probably many more considerations about what behavior is expected here, so it's likely not that simple.

jodydonetti commented 3 months ago

Hi @martindisch , today I've been finally able to get to a point with this.

This is the test I created:

[Fact]
public async Task CanHandleEagerRefreshCancellationAsync()
{
  var duration = TimeSpan.FromSeconds(2);
  var lockTimeout = TimeSpan.FromSeconds(10);
  var eagerRefreshThreshold = 0.1f;
  var eagerRefreshDelay = TimeSpan.FromSeconds(5);

  using var cache = new FusionCache(new FusionCacheOptions(), logger: CreateXUnitLogger<FusionCache>());

  cache.DefaultEntryOptions.Duration = duration;
  cache.DefaultEntryOptions.EagerRefreshThreshold = eagerRefreshThreshold;

  // EXECUTE FACTORY
  var v1 = await cache.GetOrSetAsync<long>("foo", async _ => DateTimeOffset.UtcNow.Ticks);

  // USE CACHED VALUE
  var v2 = await cache.GetOrSetAsync<long>("foo", async _ => DateTimeOffset.UtcNow.Ticks);

  // WAIT FOR EAGER REFRESH THRESHOLD TO BE HIT
  var eagerDuration = TimeSpan.FromMilliseconds(duration.TotalMilliseconds * eagerRefreshThreshold).Add(TimeSpan.FromMilliseconds(10));
  await Task.Delay(eagerDuration);

  // EAGER REFRESH KICKS IN
  var eagerRefreshIsStarted = false;
  var eagerRefreshIsEnded = false;
  using var cts = new CancellationTokenSource();
  var v3 = await cache.GetOrSetAsync<long>(
      "foo",
      async ct =>
      {
          eagerRefreshIsStarted = true;
          await Task.Delay(eagerRefreshDelay);
          ct.ThrowIfCancellationRequested();
          eagerRefreshIsEnded = true;
          return DateTimeOffset.UtcNow.Ticks;
      },
      token: cts.Token
  );

  // ALLOW EAGER REFRESH TO START
  await Task.Delay(TimeSpan.FromMilliseconds(50));

  // CANCEL
  cts.Cancel();

  // WAIT FOR THE BACKGROUND FACTORY (EAGER REFRESH) TO COMPLETE
  await Task.Delay(eagerRefreshDelay.PlusALittleBit());

  // GET THE REFRESHED VALUE
  var sw = Stopwatch.StartNew();
  var v4 = await cache.GetOrSetAsync<long>("foo", async _ => DateTimeOffset.UtcNow.Ticks, options =>
  {
      options.LockTimeout = lockTimeout;
  });
  sw.Stop();

  Assert.Equal(v1, v2);
  Assert.Equal(v2, v3);
  Assert.True(eagerRefreshIsStarted);
  Assert.False(eagerRefreshIsEnded);
  Assert.True(sw.Elapsed < lockTimeout);
  Assert.True(v4 > v3);
}

and it correctly catched the deadlock issue, which I can confirm is there.

By removing the token from the method as in your PR, it did in fact went away.

But, as you also mentioned in the follow up comment:

We were also wondering if maybe it would be better not to pass the user-supplied external CancellationToken into the factory for eager refreshes in general. For "blocking" cache accesses where the factory has to run in order to return the value (e.g. there was no cached value at all) it makes sense, these I would probably want to cancel if the HTTP request is cancelled. An eager refresh on the other hand I might want to continue regardless of whether the request that triggered it completes, because it's decoupled from it.

this is exactly was I was thinking, and not just for the case of eager refresh, but also for background factory completion with soft/hard timeouts, too.

But there are probably many more considerations about what behavior is expected here, so it's likely not that simple.

Exactly.

Currently I'm playing with a version that completely ignore the cancellation token for background factories (meaning both eager refresh + soft timeouts): this though, of course, would mean that there would not be any way at all to cancel such operations. Is this correct? It seems so, but... I'd like a little bit more time to make up my mind, and also see if there are other situations where this may have an impact.

I'll update you soon, meanwhile thanks a lot for spotting this and for the PR, great job!

martindisch commented 3 months ago

Wonderful, thanks for the update and good idea to reproduce it more neatly as a test.

The issue of not being able to cancel background factories at all is indeed worth considering, even though I currently feel it's the better option compared to the risk of people inadvertently having their background factories frequently cancelled when passing the request's CancellationToken. If it's difficult to anticipate the need for cancelling background factories, one approach would be to not support it for now and later retrofit it into the API surface in a non-breaking way by introducing a second CancellationToken for just this purpose, if the feature is requested.

In any case, take your time to think about what you'd prefer. In our organization we currently recommend omitting the CancellationToken for cache accesses to work around the issue, so we can easily wait for the next release.

jodydonetti commented 3 months ago

and later retrofit it into the API surface in a non-breaking way by introducing a second CancellationToken for just this purpose, if the feature is requested.

I was thinking about kinda the same path, but with a difference: in case it would be needed again, I would not add a second CancellationToken but an entry option, to allow the cancellation to flow through background factories.

Will update soon, thanks!

jodydonetti commented 3 months ago

Hi @martindisch , the next v1.1.0 is almost ready, and it will contains the fix for this.

Will update asap.

martindisch commented 3 months ago

Appreciate it, thanks!

jodydonetti commented 3 months ago

Hi, I just release v1.1.0 🥳

ZiggyCreatures / FusionCache

[BUG] Deadlock caused by cancellation of eager refresh #218