Inconsistent state problems writing state in a timer callback

I'm having some problems with InconsistentStateException being thrown from MemoryStorage (I'm just testing). I believe I have narrowed the problem down to writing state from a timer callback. According to the documentation a timer callback is invoked in a special way:

Note however, asyncCallback invocations are not delivered as messages and are thus not subject to message interleaving semantics. This means that invocations of asyncCallback should be considered to behave as if running on a reentrant grain with respect to other messages to that grain.

Here is conceptually my code:

public Task GrainMethodAsync()
{
    // ...
    await WriteStateAsync();
    // ...
}

private Task TimerCallbackAsync(object state)
{
    // ...
    await WriteStateAsync();
    // ...
}

What I believe is happening is that GrainMethodAsync executes. While awaiting WriteStateAsync the timer fires and TimerCallbackAsync starts executing like described in the quote above eventually calling WriteStateAsync. Both calls will use the same ETag but the first call to complete will change the stored ETag to a new value. The second call will then fail with an ETag mismatch.

Is this a problem specific to MemoryStorage or should I always avoid writing state in a timer callback? That a timer callback should be considered as if it executes in a reentrant grain sort of hints at there is a problem with state in the callback. However, if state cannot be persisted in a timer callback it makes timers much less useful.

To verify my findings I have created a small test that reproduces the problem. To run this test create a new Unit Test Project in Visual Studio (I did this using .NET Framework 4.5.1). Then add the following NuGet packages to the project:

Microsoft.Orleans.TestingHost
Microsoft.Orleans.OrleansCodeGenerator
Microsoft.Orleans.OrleansTelemetryConsumers.Counters

Finally replace the contents of the UnitTest1.cs file with the following code:

using System;
using System.Threading.Tasks;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Orleans;
using Orleans.Runtime;
using Orleans.Runtime.Configuration;
using Orleans.TestingHost;

namespace UnitTestProject1
{
    [TestClass]
    public class UnitTest1 : TestCluster
    {
        public UnitTest1() : base(CreateTestClusterOptions())
        {
        }

        [TestMethod]
        public async Task TestMethod1()
        {
            Deploy();
            var testGrain = GrainFactory.GetGrain<ITestGrain>(Guid.Empty);
            try
            {
                while (true)
                    await testGrain.GrainMethodAsync();
            }
            catch (OrleansException ex)
            {
                StopAllSilos();
                Assert.Fail(ex.Message);
            }
        }

        private static TestClusterOptions CreateTestClusterOptions()
        {
            var options = new TestClusterOptions();
            options.ClusterConfiguration.AddMemoryStorageProvider("Default");
            return options;
        }
    }

    public interface ITestGrain : IGrainWithGuidKey
    {
        Task GrainMethodAsync();
    }

    [Serializable]
    public class TestState
    {
        public long Version { get; set; }
    }

    public class TestGrain : Grain<TestState>, ITestGrain
    {
        private IDisposable timer;

        public override Task OnActivateAsync()
        {
            var oneTick = TimeSpan.FromTicks(1);
            timer = RegisterTimer(TimerCallbackAsync, null, oneTick, oneTick);
            return base.OnActivateAsync();
        }

        public Task GrainMethodAsync()
        {
            State.Version += 1;
            return WriteStateAsync();
        }

        private Task TimerCallbackAsync(object state)
        {
            State.Version += 1;
            return WriteStateAsync();
        }
    }
}

On my machine the test fails when it is executed using the test runner with an ETag mismatch exception being thrown. Because it is about a race condition the time for the test to fail varies.

What I believe is happening is that GrainMethodAsync executes. While awaiting WriteStateAsync the timer fires and TimerCallbackAsync starts executing like described in the quote above eventually calling WriteStateAsync. Both calls will use the same ETag but the first call to complete will change the stored ETag to a new value. The second call will then fail with an ETag mismatch.

I think this is a correct description of the current behavior with timer callbacks executing in the reentrant manner. I can't find an open issue for making them non-reentrant. Looks like we need one.

Is this a problem specific to MemoryStorage or should I always avoid writing state in a timer callback?

This is a general problem. The workaround it to make a grain call (to itself) from the timer callback, so that the logic will execute in the non-reentrant way.

I opened https://github.com/dotnet/orleans/issues/2574 for the timer callback reentrancy issue.

If it helps anyone, I worked around the issue by using the cumulative write approach from the Chirper sample. Below is that original code lifted onto a WriteReentrantStateAsync() extension method for the IPersistentState<T> class, that caters for reentrancy. This has the benefit of also catering for storage system back-pressure, at the expense of opening a window that allows use of unpersisted state.

Use of ConditionalWeakTable is for lazy convenience. Best to use a member property in the grain if possible.

public static class PersistentStateExtensions
{
    private static readonly ConditionalWeakTable<object, OutstandingTaskHolder> OutstandingWriteOperations =
        new ConditionalWeakTable<object, OutstandingTaskHolder>();

    private class OutstandingTaskHolder
    {
        public Task Task { get; set; }
    }

    public static async Task WriteReentrantStateAsync<T>(this IPersistentState<T> state) where T : new()
    {
        var outstanding = OutstandingWriteOperations.GetOrCreateValue(state);

        // When reentrant grain is doing WriteStateAsync, etag violations are possible due to concurrent writes.
        // The solution is to serialize and batch writes, and make sure only a single write is outstanding at any moment in time.
        var current = outstanding.Task;
        if (current != null)
        {
            try
            {
                // await the outstanding write, but ignore it since it doesn't include our changes
                await current;
            }
            catch
            {
                // Ignore all errors from this in-flight write operation, since the original caller(s) of it will observe it.
            }
            finally
            {
                if (outstanding.Task == current)
                {
                    // only null out the outstanding operation if it's the same one as the one we awaited, otherwise
                    // another request might have already done so.
                    outstanding.Task = null;
                }
            }
        }

        if (outstanding.Task == null)
        {
            // If after the initial write is completed, no other request initiated a new write operation, do it now.
            outstanding.Task = current = state.WriteStateAsync();
        }
        else
        {
            // If there were many requests enqueued to persist state, there is no reason to enqueue a new write 
            // operation for each, since any write (after the initial one that we already awaited) will have cumulative
            // changes including the one requested by our caller. Just await the new outstanding write.
            current = outstanding.Task;
        }

        try
        {
            await current;
        }
        finally
        {
            if (outstanding.Task == current)
            {
                // only null out the outstanding operation if it's the same one as the one we awaited, otherwise
                // another request might have already done so.
                outstanding.Task = null;
            }
        }
    }
}

dotnet / orleans

Inconsistent state problems writing state in a timer callback #2565