very frequent applying configuration + occasional timeout

freddyrios commented 1 year ago

Turning on Debug level logging shows that applying a configuration is done all the time even though there are no changes.

Here is how it looks:

As seen the above the fingerprint is always the same (as there are no changes) + apply config is true. That should always hit the first case below, with I wonder if its intended as the config is already correct.

                    var fingerprint = (ConfigurationStorage.ProposedConfiguration ?? ConfigurationStorage.ActiveConfiguration).Fingerprint;
                        Logger.IncomingConfiguration(fingerprint, config.Fingerprint, applyConfig);
                        switch ((config.Fingerprint == fingerprint, applyConfig))
                        {
                            case (true, true):
                                await ConfigurationStorage.ApplyAsync(token).ConfigureAwait(false);
                                break;
                            case (true, false):
                                break;
                            case (false, false):
                                await ConfigurationStorage.ProposeAsync(config).ConfigureAwait(false);
                                break;
                            case (false, true):
                                result = result with { Value = false };
                                break;
                        }

The occassional timeouts is why I found the above. A test cluster has hit 3 timeouts in 8 days (so not fast to reproduce). The warning level logging gives a single entry when the timeout occurs:

2023-06-20 00:21:57.7202|WARN|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Timeout occurred while processing request from 192.168.100.11:40896|System.Threading.Tasks.TaskCanceledException: A task was canceled.                                                                                                                                                                                                                
at DotNext.Net.Cluster.Consensus.Raft.Membership.ClusterConfigurationStorage`1.DotNext.Net.Cluster.Consensus.Raft.Membership.IClusterConfigurationStorage.ApplyAsync(CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Membership/ClusterConfigurationStorage.cs:line 144                                                                                                                 
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)                                                 
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 629                                              
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)                                        
at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/ConnectionOriented/Server.cs:line 121                                                                                                                                                   
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)                                                 
at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 135

Since I am running with a fixed member configuration loaded at start, it is extra odd to see timeouts applying configuration during append entries as the configuration is always the same and there should be nothing to apply, right?

The other thing that I find odd there is I thought the InMemoryClusterConfigurationStorage is only holding the config in memory, so how/why it would take long enough in ApplyAsync to trigger a timeout. Is it possible there is some odd locking issue with this area of the implementation, perhaps related to any events raised while the lock is taken in there.