dotnet / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
1.62k stars 122 forks source link

Sometimes when starting multiple cluster members fast, it causes infinite loop in leadership #149

Closed davhdavh closed 1 year ago

davhdavh commented 1 year ago

Occasionally when starting up the cluster, as the example in https://github.com/davhdavh/Raft3DockerClusterExample the result is that it works, but it also causes an endless loop of leadership reelection messages. It seems related to https://github.com/dotnet/dotNext/issues/108

It might also be because I just don't understand how AddMemberAsync is suppose to be used, the docs you added in https://github.com/dotnet/dotNext/issues/146 are clear as mud. Or ColdStart? How am I supposed to known which cluster member is the "first" to start, or rather if I also have cluster aware decision making, why would I need to use Raft?

davhdavh commented 1 year ago

Sometimes there is also partial error, such as this example where it elects a leader, but then immediately after sending a message trough the messaging system decides that there is no consensus anymore and start election:

info: Raft3DockerClusterExample.ClusterConfigurator[0] Peer discovered: https://raft3dockerclusterexample_7098/ Raft3DockerClusterExample.ClusterConfigurator: Information: Peer discovered: https://raft3dockerclusterexample_7098/ info: Raft3DockerClusterExample.ClusterConfigurator[0] Peer discovered: https://raft3dockerclusterexample_7097/ Raft3DockerClusterExample.ClusterConfigurator: Information: Peer discovered: https://raft3dockerclusterexample_7097/ info: Raft3DockerClusterExample.ClusterConfigurator[0] Peer discovered: https://raft3dockerclusterexample_7096/ Raft3DockerClusterExample.ClusterConfigurator: Information: Peer discovered: https://raft3dockerclusterexample_7096/ warn: Microsoft.AspNetCore.DataProtection.Repositories.FileSystemXmlRepository[60] info: Raft3DockerClusterExample.ClusterConfigurator[0] Peer discovered: https://raft3dockerclusterexample_7097/ Raft3DockerClusterExample.ClusterConfigurator: Information: Peer discovered: https://raft3dockerclusterexample_7097/ info: Raft3DockerClusterExample.ClusterConfigurator[0] Peer discovered: https://raft3dockerclusterexample_7096/ Raft3DockerClusterExample.ClusterConfigurator: Information: Peer discovered: https://raft3dockerclusterexample_7096/ info: Raft3DockerClusterExample.ClusterConfigurator[0] Peer discovered: https://raft3dockerclusterexample_7098/ Raft3DockerClusterExample.ClusterConfigurator: Information: Peer discovered: https://raft3dockerclusterexample_7098/ info: Raft3DockerClusterExample.ClusterConfigurator[0] Peer discovered: https://raft3dockerclusterexample_7097/ Raft3DockerClusterExample.ClusterConfigurator: Information: Peer discovered: https://raft3dockerclusterexample_7097/ info: Raft3DockerClusterExample.ClusterConfigurator[0] Peer discovered: https://raft3dockerclusterexample_7098/ Raft3DockerClusterExample.ClusterConfigurator: Information: Peer discovered: https://raft3dockerclusterexample_7098/ info: Raft3DockerClusterExample.ClusterConfigurator[0] Peer discovered: https://raft3dockerclusterexample_7096/ Raft3DockerClusterExample.ClusterConfigurator: Information: Peer discovered: https://raft3dockerclusterexample_7096/ info: Microsoft.Hosting.Lifetime[14] Now listening on: https://[::]:443 Microsoft.Hosting.Lifetime: Information: Now listening on: https://[::]:443 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down. Microsoft.Hosting.Lifetime: Information: Application started. Press Ctrl+C to shut down. info: Microsoft.Hosting.Lifetime[0] Hosting environment: Development Microsoft.Hosting.Lifetime: Information: Hosting environment: Development info: Microsoft.Hosting.Lifetime[0] Content root path: C:\app Microsoft.Hosting.Lifetime: Information: Content root path: C:\app info: Microsoft.Hosting.Lifetime[14] Now listening on: https://[::]:443 Microsoft.Hosting.Lifetime: Information: Now listening on: https://[::]:443 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down. Microsoft.Hosting.Lifetime: Information: Application started. Press Ctrl+C to shut down. info: Microsoft.Hosting.Lifetime[0] Hosting environment: Development Microsoft.Hosting.Lifetime: Information: Hosting environment: Development info: Microsoft.Hosting.Lifetime[0] Content root path: C:\app Microsoft.Hosting.Lifetime: Information: Content root path: C:\app info: Microsoft.Hosting.Lifetime[14] Now listening on: https://[::]:443 Microsoft.Hosting.Lifetime: Information: Now listening on: https://[::]:443 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down. Microsoft.Hosting.Lifetime: Information: Application started. Press Ctrl+C to shut down. info: Microsoft.Hosting.Lifetime[0] Hosting environment: Development Microsoft.Hosting.Lifetime: Information: Hosting environment: Development info: Microsoft.Hosting.Lifetime[0] Content root path: C:\app Microsoft.Hosting.Lifetime: Information: Content root path: C:\app warn: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75001] Cluster member https://raft3dockerclusterexample_7098/ is unavailable System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts) at System.Net.Http.HttpClient.g__Core|830(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at DotNext.Net.Cluster.Consensus.Raft.Http.RaftClusterMember.SendAsync[TResponse,TMessage](TMessage message, CancellationToken token) in //src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftClusterMember.cs:line 89 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Warning: Cluster member https://raft3dockerclusterexample_7098/ is unavailable

System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts) at System.Net.Http.HttpClient.g__Core|830(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at DotNext.Net.Cluster.Consensus.Raft.Http.RaftClusterMember.SendAsync[TResponse,TMessage](TMessage message, CancellationToken token) in //src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftClusterMember.cs:line 89 warn: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75001] Cluster member https://raft3dockerclusterexample_7097/ is unavailable System.TimeoutException: A connection could not be established within the configured ConnectTimeout. DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Warning: Cluster member https://raft3dockerclusterexample_7097/ is unavailable

System.TimeoutException: A connection could not be established within the configured ConnectTimeout. info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 0 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Information: Transition to Candidate state has started with term 0 warn: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75001] Cluster member https://raft3dockerclusterexample_7096/ is unavailable System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request.. ---> System.Net.Sockets.SocketException (995): The I/O operation has been aborted because of either a thread exit or an application request. --- End of inner exception stack trace --- at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken) at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder1.StateMachineBox1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory1 buffer, CancellationToken cancellationToken) at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Http.HttpConnection.InitialFillAsync(Boolean async) at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) --- End of inner exception stack trace --- at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts) at System.Net.Http.HttpClient.g__Core|830(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at DotNext.Net.Cluster.Consensus.Raft.Http.RaftClusterMember.SendAsync[TResponse,TMessage](TMessage message, CancellationToken token) in //src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftClusterMember.cs:line 89 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Warning: Cluster member https://raft3dockerclusterexample_7096/ is unavailable

System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request.. ---> System.Net.Sockets.SocketException (995): The I/O operation has been aborted because of either a thread exit or an application request. --- End of inner exception stack trace --- at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken) at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder1.StateMachineBox1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory1 buffer, CancellationToken cancellationToken) at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder1.StateMachineBox1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token) at System.Net.Http.HttpConnection.InitialFillAsync(Boolean async) at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) --- End of inner exception stack trace --- at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts) at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at DotNext.Net.Cluster.Consensus.Raft.Http.RaftClusterMember.SendAsync[TResponse,TMessage](TMessage message, CancellationToken token) in /_/src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftClusterMember.cs:line 89 fail: Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware[1] An unhandled exception has occurred while executing the request. System.OperationCanceledException: The operation was canceled. at System.Threading.CancellationToken.ThrowOperationCanceledException() at System.Threading.CancellationToken.ThrowIfCancellationRequested() at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.Microsoft.AspNetCore.Http.Features.IHttpResponseBodyFeature.StartAsync(CancellationToken cancellationToken) at Microsoft.AspNetCore.Http.HttpResponseWritingExtensions.WriteAsync(HttpResponse response, String text, Encoding encoding, CancellationToken cancellationToken) at DotNext.Net.Cluster.Consensus.Raft.Http.HttpMessage.SaveResponseAsync[T](HttpResponse response, T result, CancellationToken token) in /_/src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/HttpMessage.cs:line 92 at DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpMessage.SaveResponseAsync[T](HttpResponse response, Result1& result, CancellationToken token) in //src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftHttpMessage.cs:line 66 at DotNext.Net.Cluster.Consensus.Raft.Http.PreVoteMessage.SaveResponseAsync(HttpResponse response, Result`1 result, CancellationToken token) in //src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/PreVoteMessage.cs:line 46 at DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster.PreVoteAsync(PreVoteMessage request, HttpResponse response, CancellationToken token) in /_/src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftHttpCluster.Messaging.cs:line 283 at Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddlewareImpl.g__Awaited|8_0(ExceptionHandlerMiddlewareImpl middleware, HttpContext context, Task task) Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware: Error: An unhandled exception has occurred while executing the request.

System.OperationCanceledException: The operation was canceled. at System.Threading.CancellationToken.ThrowOperationCanceledException() at System.Threading.CancellationToken.ThrowIfCancellationRequested() at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.Microsoft.AspNetCore.Http.Features.IHttpResponseBodyFeature.StartAsync(CancellationToken cancellationToken) at Microsoft.AspNetCore.Http.HttpResponseWritingExtensions.WriteAsync(HttpResponse response, String text, Encoding encoding, CancellationToken cancellationToken) at DotNext.Net.Cluster.Consensus.Raft.Http.HttpMessage.SaveResponseAsync[T](HttpResponse response, T result, CancellationToken token) in //src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/HttpMessage.cs:line 92 at DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpMessage.SaveResponseAsync[T](HttpResponse response, Result`1& result, CancellationToken token) in //src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftHttpMessage.cs:line 66 at DotNext.Net.Cluster.Consensus.Raft.Http.PreVoteMessage.SaveResponseAsync(HttpResponse response, Result1 result, CancellationToken token) in /_/src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/PreVoteMessage.cs:line 46 at DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster.PreVoteAsync(PreVoteMessage request, HttpResponse response, CancellationToken token) in /_/src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftHttpCluster.Messaging.cs:line 283 at Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddlewareImpl.<Invoke>g__Awaited|8_0(ExceptionHandlerMiddlewareImpl middleware, HttpContext context, Task task) info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 0 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Information: Transition to Candidate state has started with term 0 warn: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75001] Cluster member https://raft3dockerclusterexample_7097/ is unavailable System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request.. ---> System.Net.Sockets.SocketException (995): The I/O operation has been aborted because of either a thread exit or an application request. --- End of inner exception stack trace --- at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token) at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken) at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder1.StateMachineBox1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token) at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory1 buffer, CancellationToken cancellationToken) at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder1.StateMachineBox1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Http.HttpConnection.InitialFillAsync(Boolean async) at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) --- End of inner exception stack trace --- at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts) at System.Net.Http.HttpClient.g__Core|830(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at DotNext.Net.Cluster.Consensus.Raft.Http.RaftClusterMember.SendAsync[TResponse,TMessage](TMessage message, CancellationToken token) in //src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftClusterMember.cs:line 89 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Warning: Cluster member https://raft3dockerclusterexample_7097/ is unavailable

System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request.. ---> System.Net.Sockets.SocketException (995): The I/O operation has been aborted because of either a thread exit or an application request. --- End of inner exception stack trace --- at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken) at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder1.StateMachineBox1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory1 buffer, CancellationToken cancellationToken) at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Http.HttpConnection.InitialFillAsync(Boolean async) at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) --- End of inner exception stack trace --- at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts) at System.Net.Http.HttpClient.g__Core|830(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at DotNext.Net.Cluster.Consensus.Raft.Http.RaftClusterMember.SendAsync[TResponse,TMessage](TMessage message, CancellationToken token) in //src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftClusterMember.cs:line 89 info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 0 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Information: Transition to Candidate state has started with term 0 info: Raft3DockerClusterExample.ValueController[0] FullInfo Raft3DockerClusterExample.ValueController: Information: FullInfo info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74003] Transition to Candidate state has completed with term 1 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Information: Transition to Candidate state has completed with term 1 info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74003] Transition to Candidate state has completed with term 1 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Information: Transition to Candidate state has completed with term 1 info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 1 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Information: Transition to Candidate state has started with term 1 info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74004] Transition to Leader state has started with term 1 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Information: Transition to Leader state has started with term 1 info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7098/. Raft3DockerClusterExample.ClusterConfigurator: Information: New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7098/. info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 1. Election timeout 00:00:00.2350000 Raft3DockerClusterExample.ClusterConfigurator: Information: Term of local cluster member is 1. Election timeout 00:00:00.2350000 info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74005] Transition to Leader state has completed with term 1 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Information: Transition to Leader state has completed with term 1 info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7098/. Raft3DockerClusterExample.ClusterConfigurator: Information: New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7098/. info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 1. Election timeout 00:00:00.1880000 Raft3DockerClusterExample.ClusterConfigurator: Information: Term of local cluster member is 1. Election timeout 00:00:00.1880000 info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7098/. Raft3DockerClusterExample.ClusterConfigurator: Information: New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7098/. info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 1. Election timeout 00:00:00.1530000 warn: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74015] Replication of https://raft3dockerclusterexample_7096/ is failed. Retry replication from entry 1 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Warning: Replication of https://raft3dockerclusterexample_7096/ is failed. Retry replication from entry 1 Raft3DockerClusterExample.ClusterConfigurator: Information: Term of local cluster member is 1. Election timeout 00:00:00.1530000 warn: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74015] Replication of https://raft3dockerclusterexample_7097/ is failed. Retry replication from entry 1 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Warning: Replication of https://raft3dockerclusterexample_7097/ is failed. Retry replication from entry 1 info: Raft3DockerClusterExample.TryMeHandler[0] Got hest Raft3DockerClusterExample.TryMeHandler: Information: Got hest info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 1 DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster: Information: Transition to Candidate state has started with term 1 info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached Raft3DockerClusterExample.ClusterConfigurator: Information: Consensus cannot be reached info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 1. Election timeout 00:00:00.1530000 Raft3DockerClusterExample.ClusterConfigurator: Information: Term of local cluster member is 1. Election timeout 00:00:00.1530000 info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7098/. Raft3DockerClusterExample.ClusterConfigurator: Information: New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7098/. info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 1. Election timeout 00:00:00.1530000 Raft3DockerClusterExample.ClusterConfigurator: Information: Term of local cluster member is 1. Election timeout 00:00:00.1530000

sakno commented 1 year ago

ColdStart parameter is used to solve chicken or the egg problem. Let's assume that we starting a new cluster in an empty environment. In this point in time, we have no nodes at all. The first node (and the only at that point in time) launched in that environment must become a leader. Because only the leader can accept and commit modifications in membership list. This why you need ColdStart=true for that node. Otherwise, all nodes will be started in Follower state, and no one is able to accept necessary modifications.

AddMemberAsync can be called on the leader node only, because the leader is responsible for distribution and consensus of membership list.

From other side, you can ignore ColdStart configuration property at all if you can pre-populate membership list at startup time consistently across all launching nodes.

sakno commented 1 year ago

According to https://github.com/davhdavh/Raft3DockerClusterExample/blob/master/Program.cs#L24, I see that you using an approach with the pre-populated configuration.

Okay, could you please share JSON configuration for each of three nodes. It is better to have report code without installing Docker and other complex environmental components. #108 was fixed and the repo has a regression test for it.

davhdavh commented 1 year ago

https://github.com/davhdavh/Raft3DockerClusterExample/blob/master/appsettings.json

For the info of all members and https://github.com/davhdavh/Raft3DockerClusterExample/blob/master/docker-compose.override.yml for the individual nodes to know who they are.

On Fri, Feb 10, 2023, 23:32 SRV @.***> wrote:

According to https://github.com/davhdavh/Raft3DockerClusterExample/blob/master/Program.cs#L24, I see that you using an approach with the pre-populated configuration.

Okay, could you please share JSON configuration for each of three nodes. It is better to have report code without installing Docker and other complex environmental components. #108 https://github.com/dotnet/dotNext/issues/108 was fixed and the repo has a regression test for it.

— Reply to this email directly, view it on GitHub https://github.com/dotnet/dotNext/issues/149#issuecomment-1426054786, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFIDK77OLOJZVL22S4NWLWWZURLANCNFSM6AAAAAAUXLWDGA . You are receiving this because you authored the thread.Message ID: @.***>

sakno commented 1 year ago

Did you try to reproduce the same without Docker? Ideally, in the same process. Or dump logs with Debug severity for each container separately. It is very hard to reproduce the environment (Windows+Docker) on my side.

davhdavh commented 1 year ago

I can only repro the problem when VS debugger is attaching to process (IE, when start from inside VS), which probably means it is a problem with latency and/or stalls. If just starting docker with 'docker compose up' it will gladly start just fine every single time (out of the 50+ times I tried)

sakno commented 1 year ago

See my answer here. To check that assumption, you can measure response time from the node. You can register HttpMetricsCollector in DI as singleton and assign ResponseTimeCounter counter. This counter will be available through dotnet-counters tool from SDK. You can use examples/RaftNode within the repository as a a sample of perf counters registration. Or you can just override HttpMetricsCollector.ReportResponseTime method.

davhdavh commented 1 year ago
   public override void ReportResponseTime(TimeSpan value)
   {
      _logger.LogInformation($"ResponseTime {value.TotalMilliseconds:F2}ms");
      base.ReportResponseTime(value);
   }

If I run it inside VS, I get single digit "Metrics: Information: ResponseTime 1.14ms" delays. But it seems just adding this logging is enough to eliminate the problem.

However, I can relatively consistently recreate the problem if I pause execution and start it again. One machine will resume fast, figure out there is a problem, start a reelection which will complete very fast, only for the others to somehow detect that there is no consensus again even though it has been just milliseconds since last reelection.

Ignore, the log just got ordered in such a way that it looks bad.

davhdavh commented 1 year ago

Double checked... Adding Metrics is enough to eliminate the problem. Without services.AddSingleton<MetricsCollector, Metrics>(); I get problems about 20% of the time. With, I havent been able to reproduce the problem even once.

12:11:03.412 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 3 12:11:03.412 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:11:03.412 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 3. Election timeout 00:00:00.2250000 12:11:03.413 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74003] Transition to Candidate state has completed with term 4 12:11:03.412 7097:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 3 12:11:03.412 7097:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:11:03.412 7097:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 3. Election timeout 00:00:00.1530000 12:11:03.414 7097:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74003] Transition to Candidate state has completed with term 4 12:11:03.518 7098:warn: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74015] Replication of https://raft3dockerclusterexample_7097/ is failed. Retry replication from entry 2 12:11:03.518 7098:warn: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74015] Replication of https://raft3dockerclusterexample_7096/ is failed. Retry replication from entry 2 12:11:03.587 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 4 12:11:03.713 7097:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 4 12:11:03.713 7097:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74003] Transition to Candidate state has completed with term 5 12:11:03.713 7098:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:11:03.713 7098:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 4. Election timeout 00:00:00.2150000 12:11:03.714 7097:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74004] Transition to Leader state has started with term 5 12:11:03.714 7097:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:11:03.714 7097:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.2920000 12:11:03.714 7097:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74005] Transition to Leader state has completed with term 5 12:11:03.715 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:11:03.716 7097:warn: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74015] Replication of https://raft3dockerclusterexample_7098/ is failed. Retry replication from entry 3 12:11:03.715 7098:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:11:03.715 7098:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.2150000 12:11:03.715 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:03.716 7097:warn: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74015] Replication of https://raft3dockerclusterexample_7096/ is failed. Retry replication from entry 3 12:11:06.112 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 5 12:11:06.112 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:11:06.112 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:06.113 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:11:06.113 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:13.081 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 5 12:11:13.081 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:11:13.081 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:13.237 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:11:13.237 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:16.669 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 5 12:11:16.669 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:11:16.669 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:16.826 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:11:16.826 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:17.301 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 5 12:11:17.301 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:11:17.301 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:17.450 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:11:17.450 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:24.049 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 5 12:11:24.049 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:11:24.049 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:24.205 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:11:24.205 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:27.090 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 5 12:11:27.090 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:11:27.090 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:11:27.245 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:11:27.245 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 ...

12:13:44.561 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 5 12:13:44.561 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:13:44.561 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:13:44.562 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:13:44.562 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:13:47.084 7096:info: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74002] Transition to Candidate state has started with term 5 12:13:47.084 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Consensus cannot be reached 12:13:47.084 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000 12:13:47.241 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] New cluster leader is elected. Leader address is https://raft3dockerclusterexample_7097/. 12:13:47.241 7096:info: Raft3DockerClusterExample.ClusterConfigurator[0] Term of local cluster member is 5. Election timeout 00:00:00.1690000

sakno commented 1 year ago

VS debugger may freeze the threads, causing sockets to terminate with a timeout. If a frequency of the error is changed because of extra code (reporting of metrics, as mentioned by you) then it's just an evidence of floating error. It's better to have stable repro scenario without side effects such as VS debugger. Otherwise, it's very hard to determine a root cause. Try to enable Debug severity for logs and obtain separate logs for each container. It will be very helpful.

davhdavh commented 1 year ago

Is it supposed to do this much work when it is in a steady state? Trying to replicate something (and failing?) every 100msec seems extreme.

03:07:12.334 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.338 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.338 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.339 7097:dbug: Metrics[0] ResponseTime 4.30ms 03:07:12.339 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7098/ is replicated successfully starting from index 7 03:07:12.339 7097:dbug: Metrics[0] ResponseTime 2.80ms 03:07:12.339 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7096/ is replicated successfully starting from index 7 03:07:12.340 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74016] Unable to commit local changes. Quorum is 3, last committed entry is 6 03:07:12.338 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.338 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.444 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7098/ started at log index 7 03:07:12.444 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7098/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.444 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7098/ 03:07:12.444 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7096/ started at log index 7 03:07:12.444 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7096/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.444 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7096/ 03:07:12.445 7097:dbug: Metrics[0] ResponseTime 1.34ms 03:07:12.445 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7098/ is replicated successfully starting from index 7 03:07:12.445 7097:dbug: Metrics[0] ResponseTime 1.36ms 03:07:12.445 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7096/ is replicated successfully starting from index 7 03:07:12.445 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74016] Unable to commit local changes. Quorum is 3, last committed entry is 6 03:07:12.441 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.441 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.441 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.441 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.445 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.445 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.445 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.445 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.553 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7098/ started at log index 7 03:07:12.553 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7096/ started at log index 7 03:07:12.553 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7096/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.553 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7096/ 03:07:12.553 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7098/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.553 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7098/ 03:07:12.555 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.551 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.551 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.551 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.551 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.555 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.555 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.555 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.555 7097:dbug: Metrics[0] ResponseTime 2.00ms 03:07:12.555 7097:dbug: Metrics[0] ResponseTime 2.38ms 03:07:12.556 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7096/ is replicated successfully starting from index 7 03:07:12.556 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7098/ is replicated successfully starting from index 7 03:07:12.556 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74016] Unable to commit local changes. Quorum is 3, last committed entry is 6 03:07:12.662 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7098/ started at log index 7 03:07:12.662 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7096/ started at log index 7 03:07:12.662 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7096/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.662 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7096/ 03:07:12.663 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7098/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.663 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7098/ 03:07:12.660 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.660 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.660 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.660 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.663 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.663 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.663 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.663 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.664 7097:dbug: Metrics[0] ResponseTime 1.24ms 03:07:12.664 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7096/ is replicated successfully starting from index 7 03:07:12.664 7097:dbug: Metrics[0] ResponseTime 1.28ms 03:07:12.664 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7098/ is replicated successfully starting from index 7 03:07:12.664 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74016] Unable to commit local changes. Quorum is 3, last committed entry is 6 03:07:12.772 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7098/ started at log index 7 03:07:12.772 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7096/ started at log index 7 03:07:12.772 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7096/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.772 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7096/ 03:07:12.772 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7098/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.772 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7098/ 03:07:12.773 7097:dbug: Metrics[0] ResponseTime 1.13ms 03:07:12.773 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7098/ is replicated successfully starting from index 7 03:07:12.774 7097:dbug: Metrics[0] ResponseTime 1.56ms 03:07:12.774 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7096/ is replicated successfully starting from index 7 03:07:12.774 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74016] Unable to commit local changes. Quorum is 3, last committed entry is 6 03:07:12.769 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.769 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.770 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.770 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.773 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.773 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.773 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.773 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.881 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7098/ started at log index 7 03:07:12.881 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7096/ started at log index 7 03:07:12.881 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7096/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.881 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7096/ 03:07:12.881 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7098/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.881 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7098/ 03:07:12.882 7097:dbug: Metrics[0] ResponseTime 1.21ms 03:07:12.882 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7096/ is replicated successfully starting from index 7 03:07:12.882 7097:dbug: Metrics[0] ResponseTime 1.04ms 03:07:12.882 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7098/ is replicated successfully starting from index 7 03:07:12.883 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74016] Unable to commit local changes. Quorum is 3, last committed entry is 6 03:07:12.878 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.879 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.879 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.879 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.882 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.882 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.882 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.882 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.991 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7098/ started at log index 7 03:07:12.991 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7098/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.991 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7098/ 03:07:12.991 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74012] Replication of member https://raft3dockerclusterexample_7096/ started at log index 7 03:07:12.992 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74013] Replication of https://raft3dockerclusterexample_7096/ contains 0 entries. Preceding entry has index 6 and term 9. Local fingerprint is 8168623821148766304, remote is 8168623821148766304, apply True 03:07:12.992 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[75000] Sending request of type AppendEntries to member https://raft3dockerclusterexample_7096/ 03:07:12.993 7097:dbug: Metrics[0] ResponseTime 1.15ms 03:07:12.993 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7096/ is replicated successfully starting from index 7 03:07:12.993 7097:dbug: Metrics[0] ResponseTime 1.64ms 03:07:12.993 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74014] Member https://raft3dockerclusterexample_7098/ is replicated successfully starting from index 7 03:07:12.993 7097:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74016] Unable to commit local changes. Quorum is 3, last committed entry is 6 03:07:12.989 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.989 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.989 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.989 7098:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True 03:07:12.992 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74000] Member is downgrading to follower state with term 9 03:07:12.992 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74011] Election timeout is refreshed 03:07:12.992 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74001] Member is downgraded to follower state with term 9 03:07:12.993 7096:dbug: DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster[74038] Incoming configuration with 8168623821148766304 fingerprint from leader. Local fingerprint is 8168623821148766304, apply True

sakno commented 1 year ago

The last log doesn't have any exceptions. Frequent (every 100ms) heartbeats are fine even if there is nothing to commit. Heartbeats prevent other nodes to detect faulty leader and transit to Candidate state. This behavior is fully compliant with Raft paper.

sakno commented 1 year ago

One thing that can improve your debugging experience: use standby mode. Attached VS debugger is probably a root cause of the problems mentioned above. Standby mode gives you determinism of cluster behavior under debugging. Do the following: start nodes B and C in this mode. The mode prevents tracking of election timeout and transitions between Raft states. Then start node A in regular mode. Node A will be turned into leader state eventually. This happens because B and C cannot transit to Candidate state. But they can respond to any Raft-specific RPC messages from node A. Now you have a stable cluster for debugging purposes (leader is deterministic and cannot be changed).