dotnet / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
1.56k stars 119 forks source link

very frequent applying configuration + occasional timeout #180

Closed freddyrios closed 11 months ago

freddyrios commented 1 year ago

Turning on Debug level logging shows that applying a configuration is done all the time even though there are no changes.

Here is how it looks: image

As seen the above the fingerprint is always the same (as there are no changes) + apply config is true. That should always hit the first case below, with I wonder if its intended as the config is already correct.

                    var fingerprint = (ConfigurationStorage.ProposedConfiguration ?? ConfigurationStorage.ActiveConfiguration).Fingerprint;
                        Logger.IncomingConfiguration(fingerprint, config.Fingerprint, applyConfig);
                        switch ((config.Fingerprint == fingerprint, applyConfig))
                        {
                            case (true, true):
                                await ConfigurationStorage.ApplyAsync(token).ConfigureAwait(false);
                                break;
                            case (true, false):
                                break;
                            case (false, false):
                                await ConfigurationStorage.ProposeAsync(config).ConfigureAwait(false);
                                break;
                            case (false, true):
                                result = result with { Value = false };
                                break;
                        }

The occassional timeouts is why I found the above. A test cluster has hit 3 timeouts in 8 days (so not fast to reproduce). The warning level logging gives a single entry when the timeout occurs:

2023-06-20 00:21:57.7202|WARN|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Timeout occurred while processing request from 192.168.100.11:40896|System.Threading.Tasks.TaskCanceledException: A task was canceled.                                                                                                                                                                                                                
at DotNext.Net.Cluster.Consensus.Raft.Membership.ClusterConfigurationStorage`1.DotNext.Net.Cluster.Consensus.Raft.Membership.IClusterConfigurationStorage.ApplyAsync(CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Membership/ClusterConfigurationStorage.cs:line 144                                                                                                                 
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)                                                 
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 629                                              
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)                                        
at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/ConnectionOriented/Server.cs:line 121                                                                                                                                                   
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)                                                 
at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 135         

Since I am running with a fixed member configuration loaded at start, it is extra odd to see timeouts applying configuration during append entries as the configuration is always the same and there should be nothing to apply, right?

The other thing that I find odd there is I thought the InMemoryClusterConfigurationStorage is only holding the config in memory, so how/why it would take long enough in ApplyAsync to trigger a timeout. Is it possible there is some odd locking issue with this area of the implementation, perhaps related to any events raised while the lock is taken in there.

sakno commented 1 year ago

with I wonder if its intended as the config is already correct.

Yes, it's intended, configuration storage is smart enough to determine whether there is no proposed configuration to apply, so ApplyAsync completes synchronously.

sakno commented 1 year ago

Is it possible there is some odd locking issue with this area of the implementation

Async lock are highly tested and I know a use case of production code running async locks with uptime greater than 6 months. Also, I tried to repro this with the following stress test, but with no luck:

[Fact]
public static async Task StressTest()
{
    using var l = new AsyncExclusiveLock();

    await Task.WhenAll(
        Task.Run(AcquireReleaseMultipleTimes),
        Task.Run(AcquireReleaseMultipleTimes),
        Task.Run(AcquireReleaseMultipleTimes));

    async Task AcquireRelease()
    {
        var cts = new CancellationTokenSource();
        cts.CancelAfter(DefaultTimeout);

        var lockTaken = false;
        try
        {
            await l.AcquireAsync(cts.Token);
            lockTaken = true;

            await Task.Yield();
        }
        finally
        {
            cts.Dispose();

            if (lockTaken)
                l.Release();
        }
    }

    async Task AcquireReleaseMultipleTimes()
    {
        for (var i = 0; i < 100; i++)
            await AcquireRelease();
    }
}

However, it is possible to prove that the issue caused by AsyncExclusiveLock. Every async lock exposes metrics, you can harvest them using OpenTelemetry (dotnet-counters can do the same, but I'm not sure that you're ready to track metrics 24/7 manually). Metrics group is DotNext.Threading.AsyncLock with tag dotnext.asynclock.type equals to AsyncExclusiveLock, metric name is suspension-duration. It measures lock duration, in milliseconds.

sakno commented 1 year ago

I pushed a new version with more precise control over measurement tags to correctly distinguish metrics associated with different instances of AsyncExclusiveLock (it is used by RaftCluster and ClusterConfigurationStorage implementations separately).

sakno commented 1 year ago

I found a cheap way to avoid calling ApplyAsync when configuration remains unchanged. Change is pushed to develop branch. However, if there is an issue with AsyncExclusiveLock, the improvement will hide it. I still need statistics about suspenstion-duration.

sakno commented 1 year ago

I've done some research about the cancellation. There are two possible cases:

An example of the first case:

using IClusterConfigurationStorage storage = new InMemoryClusterConfigurationStorage();
await storage.ApplyAsync(new(true));

In this case, the stack trace looks exactly as you pointed out:

System.Threading.Tasks.TaskCanceledException : A task was canceled.
   at DotNext.Net.Cluster.Consensus.Raft.Membership.ClusterConfigurationStorage`1.DotNext.Net.Cluster.Consensus.Raft.Membership.IClusterConfigurationStorage.ApplyAsync(CancellationToken token) in ~/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Membership/ClusterConfigurationStorage.cs:line 143
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)

It is happening because AsyncExclusiveLock has optimization path that checks whether the token is canceled or not. If canceled, no need to allocate a new task and return ValueTask.FromCanceled which is a cached canceled task. In that case, the exception is thrown inside of IClusterConfigurationStorage.ApplyAsync method instead of AcquireAsync, because AcquireAsync is completed synchronously without throwing any exception.

In the second case, the stack trace looks very different:

System.OperationCanceledException : The operation was canceled.
   at DotNext.Result`1.Validate() in ~/projects/dotnext/src/DotNext/Result.cs:line 186
   at DotNext.Result`1.get_Value() in ~/projects/dotnext/src/DotNext/Result.cs:line 181
   at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.GetResult(Int16 token) in ~/projects/dotnext/src/DotNext.Threading/Threading/Tasks/ValueTaskCompletionSource.T.cs:line 230
   at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) in ~/dotnext/src/DotNext.Threading/Threading/Tasks/ValueTaskCompletionSource.T.cs:line 237
   at DotNext.Net.Cluster.Consensus.Raft.Membership.ClusterConfigurationStorage`1.DotNext.Net.Cluster.Consensus.Raft.Membership.IClusterConfigurationStorage.ApplyAsync(CancellationToken token) in ~/projects/dotnext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Membership/ClusterConfigurationStorage.cs:line 143
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)

This research demonstrates that the token associated with the timeout is already canceled at the time of calling ApplyAsync method.

freddyrios commented 11 months ago

Thanks for looking into it, I thought the stack was certainly weird but did not occur why that could be.

By the way, I was trying to collect the data but early attempts were not reproducing again and then other stuff got in the way.

When I get back to this I will try to find some better info on what's going on, as AppendEntries in that environment should not have been hitting the large timeouts in any case. Might even be related to an issue we saw in a different cluster with smaller timeouts where there was a 15 seconds window were all AppendEntries retries for a specific node kept timing out (and then resumed normal operation after that).

sakno commented 11 months ago

It can be I/O issue (e.g. network delays) causing token cancellation inside of AppendEntries.

freddyrios commented 11 months ago

@sakno we were certainly hit by an issue with eee (energy efficient ethernet) https://github.com/raspberrypi/linux/issues/4289 and raspberry in some clusters, and that certainly applies to the cluster where the 15 seconds issue happened.

Anyway we were already planning to gather more stats, including network, to be able to better understand this issues. Seems that is the right direction.

sakno commented 11 months ago

You can also measure broadcast-time over the long period of time to ensure that leader has enough time to send heartbeats.