Closed Liversage closed 7 years ago
To verify my findings I have created a small test that reproduces the problem. To run this test create a new Unit Test Project in Visual Studio (I did this using .NET Framework 4.5.1). Then add the following NuGet packages to the project:
Finally replace the contents of the UnitTest1.cs
file with the following code:
using System;
using System.Threading.Tasks;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Orleans;
using Orleans.Runtime;
using Orleans.Runtime.Configuration;
using Orleans.TestingHost;
namespace UnitTestProject1
{
[TestClass]
public class UnitTest1 : TestCluster
{
public UnitTest1() : base(CreateTestClusterOptions())
{
}
[TestMethod]
public async Task TestMethod1()
{
Deploy();
var testGrain = GrainFactory.GetGrain<ITestGrain>(Guid.Empty);
try
{
while (true)
await testGrain.GrainMethodAsync();
}
catch (OrleansException ex)
{
StopAllSilos();
Assert.Fail(ex.Message);
}
}
private static TestClusterOptions CreateTestClusterOptions()
{
var options = new TestClusterOptions();
options.ClusterConfiguration.AddMemoryStorageProvider("Default");
return options;
}
}
public interface ITestGrain : IGrainWithGuidKey
{
Task GrainMethodAsync();
}
[Serializable]
public class TestState
{
public long Version { get; set; }
}
public class TestGrain : Grain<TestState>, ITestGrain
{
private IDisposable timer;
public override Task OnActivateAsync()
{
var oneTick = TimeSpan.FromTicks(1);
timer = RegisterTimer(TimerCallbackAsync, null, oneTick, oneTick);
return base.OnActivateAsync();
}
public Task GrainMethodAsync()
{
State.Version += 1;
return WriteStateAsync();
}
private Task TimerCallbackAsync(object state)
{
State.Version += 1;
return WriteStateAsync();
}
}
}
On my machine the test fails when it is executed using the test runner with an ETag mismatch exception being thrown. Because it is about a race condition the time for the test to fail varies.
What I believe is happening is that GrainMethodAsync executes. While awaiting WriteStateAsync the timer fires and TimerCallbackAsync starts executing like described in the quote above eventually calling WriteStateAsync. Both calls will use the same ETag but the first call to complete will change the stored ETag to a new value. The second call will then fail with an ETag mismatch.
I think this is a correct description of the current behavior with timer callbacks executing in the reentrant manner. I can't find an open issue for making them non-reentrant. Looks like we need one.
Is this a problem specific to MemoryStorage or should I always avoid writing state in a timer callback?
This is a general problem. The workaround it to make a grain call (to itself) from the timer callback, so that the logic will execute in the non-reentrant way.
I opened https://github.com/dotnet/orleans/issues/2574 for the timer callback reentrancy issue.
If it helps anyone, I worked around the issue by using the cumulative write approach from the Chirper sample. Below is that original code lifted onto a WriteReentrantStateAsync()
extension method for the IPersistentState<T>
class, that caters for reentrancy. This has the benefit of also catering for storage system back-pressure, at the expense of opening a window that allows use of unpersisted state.
Use of ConditionalWeakTable
is for lazy convenience. Best to use a member property in the grain if possible.
public static class PersistentStateExtensions
{
private static readonly ConditionalWeakTable<object, OutstandingTaskHolder> OutstandingWriteOperations =
new ConditionalWeakTable<object, OutstandingTaskHolder>();
private class OutstandingTaskHolder
{
public Task Task { get; set; }
}
public static async Task WriteReentrantStateAsync<T>(this IPersistentState<T> state) where T : new()
{
var outstanding = OutstandingWriteOperations.GetOrCreateValue(state);
// When reentrant grain is doing WriteStateAsync, etag violations are possible due to concurrent writes.
// The solution is to serialize and batch writes, and make sure only a single write is outstanding at any moment in time.
var current = outstanding.Task;
if (current != null)
{
try
{
// await the outstanding write, but ignore it since it doesn't include our changes
await current;
}
catch
{
// Ignore all errors from this in-flight write operation, since the original caller(s) of it will observe it.
}
finally
{
if (outstanding.Task == current)
{
// only null out the outstanding operation if it's the same one as the one we awaited, otherwise
// another request might have already done so.
outstanding.Task = null;
}
}
}
if (outstanding.Task == null)
{
// If after the initial write is completed, no other request initiated a new write operation, do it now.
outstanding.Task = current = state.WriteStateAsync();
}
else
{
// If there were many requests enqueued to persist state, there is no reason to enqueue a new write
// operation for each, since any write (after the initial one that we already awaited) will have cumulative
// changes including the one requested by our caller. Just await the new outstanding write.
current = outstanding.Task;
}
try
{
await current;
}
finally
{
if (outstanding.Task == current)
{
// only null out the outstanding operation if it's the same one as the one we awaited, otherwise
// another request might have already done so.
outstanding.Task = null;
}
}
}
}
I'm having some problems with
InconsistentStateException
being thrown fromMemoryStorage
(I'm just testing). I believe I have narrowed the problem down to writing state from a timer callback. According to the documentation a timer callback is invoked in a special way:Here is conceptually my code:
What I believe is happening is that
GrainMethodAsync
executes. While awaitingWriteStateAsync
the timer fires andTimerCallbackAsync
starts executing like described in the quote above eventually callingWriteStateAsync
. Both calls will use the sameETag
but the first call to complete will change the storedETag
to a new value. The second call will then fail with anETag
mismatch.Is this a problem specific to
MemoryStorage
or should I always avoid writing state in a timer callback? That a timer callback should be considered as if it executes in a reentrant grain sort of hints at there is a problem with state in the callback. However, if state cannot be persisted in a timer callback it makes timers much less useful.