Closed freddyrios closed 11 months ago
with I wonder if its intended as the config is already correct.
Yes, it's intended, configuration storage is smart enough to determine whether there is no proposed configuration to apply, so ApplyAsync
completes synchronously.
Is it possible there is some odd locking issue with this area of the implementation
Async lock are highly tested and I know a use case of production code running async locks with uptime greater than 6 months. Also, I tried to repro this with the following stress test, but with no luck:
[Fact]
public static async Task StressTest()
{
using var l = new AsyncExclusiveLock();
await Task.WhenAll(
Task.Run(AcquireReleaseMultipleTimes),
Task.Run(AcquireReleaseMultipleTimes),
Task.Run(AcquireReleaseMultipleTimes));
async Task AcquireRelease()
{
var cts = new CancellationTokenSource();
cts.CancelAfter(DefaultTimeout);
var lockTaken = false;
try
{
await l.AcquireAsync(cts.Token);
lockTaken = true;
await Task.Yield();
}
finally
{
cts.Dispose();
if (lockTaken)
l.Release();
}
}
async Task AcquireReleaseMultipleTimes()
{
for (var i = 0; i < 100; i++)
await AcquireRelease();
}
}
However, it is possible to prove that the issue caused by AsyncExclusiveLock. Every async lock exposes metrics, you can harvest them using OpenTelemetry (dotnet-counters
can do the same, but I'm not sure that you're ready to track metrics 24/7 manually). Metrics group is DotNext.Threading.AsyncLock
with tag dotnext.asynclock.type
equals to AsyncExclusiveLock
, metric name is suspension-duration
. It measures lock duration, in milliseconds.
I pushed a new version with more precise control over measurement tags to correctly distinguish metrics associated with different instances of AsyncExclusiveLock (it is used by RaftCluster and ClusterConfigurationStorage implementations separately).
I found a cheap way to avoid calling ApplyAsync
when configuration remains unchanged. Change is pushed to develop branch. However, if there is an issue with AsyncExclusiveLock, the improvement will hide it. I still need statistics about suspenstion-duration
.
I've done some research about the cancellation. There are two possible cases:
ApplyAsync
accepting already canceled tokenApplyAsync
accepting normal token but canceled in attempt to obtain exclusive lock asynchronouslyAn example of the first case:
using IClusterConfigurationStorage storage = new InMemoryClusterConfigurationStorage();
await storage.ApplyAsync(new(true));
In this case, the stack trace looks exactly as you pointed out:
System.Threading.Tasks.TaskCanceledException : A task was canceled.
at DotNext.Net.Cluster.Consensus.Raft.Membership.ClusterConfigurationStorage`1.DotNext.Net.Cluster.Consensus.Raft.Membership.IClusterConfigurationStorage.ApplyAsync(CancellationToken token) in ~/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Membership/ClusterConfigurationStorage.cs:line 143
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
It is happening because AsyncExclusiveLock has optimization path that checks whether the token is canceled or not. If canceled, no need to allocate a new task and return ValueTask.FromCanceled
which is a cached canceled task. In that case, the exception is thrown inside of IClusterConfigurationStorage.ApplyAsync
method instead of AcquireAsync
, because AcquireAsync
is completed synchronously without throwing any exception.
In the second case, the stack trace looks very different:
System.OperationCanceledException : The operation was canceled.
at DotNext.Result`1.Validate() in ~/projects/dotnext/src/DotNext/Result.cs:line 186
at DotNext.Result`1.get_Value() in ~/projects/dotnext/src/DotNext/Result.cs:line 181
at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.GetResult(Int16 token) in ~/projects/dotnext/src/DotNext.Threading/Threading/Tasks/ValueTaskCompletionSource.T.cs:line 230
at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) in ~/dotnext/src/DotNext.Threading/Threading/Tasks/ValueTaskCompletionSource.T.cs:line 237
at DotNext.Net.Cluster.Consensus.Raft.Membership.ClusterConfigurationStorage`1.DotNext.Net.Cluster.Consensus.Raft.Membership.IClusterConfigurationStorage.ApplyAsync(CancellationToken token) in ~/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Membership/ClusterConfigurationStorage.cs:line 143
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
This research demonstrates that the token associated with the timeout is already canceled at the time of calling ApplyAsync
method.
Thanks for looking into it, I thought the stack was certainly weird but did not occur why that could be.
By the way, I was trying to collect the data but early attempts were not reproducing again and then other stuff got in the way.
When I get back to this I will try to find some better info on what's going on, as AppendEntries in that environment should not have been hitting the large timeouts in any case. Might even be related to an issue we saw in a different cluster with smaller timeouts where there was a 15 seconds window were all AppendEntries retries for a specific node kept timing out (and then resumed normal operation after that).
It can be I/O issue (e.g. network delays) causing token cancellation inside of AppendEntries.
@sakno we were certainly hit by an issue with eee (energy efficient ethernet) https://github.com/raspberrypi/linux/issues/4289 and raspberry in some clusters, and that certainly applies to the cluster where the 15 seconds issue happened.
Anyway we were already planning to gather more stats, including network, to be able to better understand this issues. Seems that is the right direction.
You can also measure broadcast-time
over the long period of time to ensure that leader has enough time to send heartbeats.
Turning on Debug level logging shows that applying a configuration is done all the time even though there are no changes.
Here is how it looks:![image](https://github.com/dotnet/dotNext/assets/70888589/356e150a-7533-457c-a1bd-aef026abd02b)
As seen the above the fingerprint is always the same (as there are no changes) + apply config is true. That should always hit the first case below, with I wonder if its intended as the config is already correct.
The occassional timeouts is why I found the above. A test cluster has hit 3 timeouts in 8 days (so not fast to reproduce). The warning level logging gives a single entry when the timeout occurs:
Since I am running with a fixed member configuration loaded at start, it is extra odd to see timeouts applying configuration during append entries as the configuration is always the same and there should be nothing to apply, right?
The other thing that I find odd there is I thought the InMemoryClusterConfigurationStorage is only holding the config in memory, so how/why it would take long enough in ApplyAsync to trigger a timeout. Is it possible there is some odd locking issue with this area of the implementation, perhaps related to any events raised while the lock is taken in there.