crash due to unhandled exception

freddyrios commented 2 years ago

This happened on a test system that is not a proper/full cluster (2 nodes), so any one failed node takes down the cluster (which is ok for this test system).

There are 2 problems:

a NullReferenceBehavior (see stacktrace further below)
unhandled exceptions queued on background threads crash the process (taking more than dotnext.raft code with it. . Related 1, 2, 3 , 4.

This exception below is from A.log captured via AppDomain.CurrentDomain.UnhandledException

04.17 20:59:50.185 - 04.17 20:59:50.168 - Unhandled exception detected
System.NullReferenceException: Object reference not set to an instance of an object.
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Partition.Read(Int32 sessionId, Int64 absoluteIndex, LogEntryReadOptimizationHint hint) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 212
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.UnsafeReadAsync[TResult](LogEntryConsumer`2& reader, MemoryOwner`1& list, Int32 sessionId, Int64 startIndex, Int64 endIndex, Int32 listIndex, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 175
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.UnsafeReadAsync[TResult](LogEntryConsumer`2 reader, Int32 sessionId, Int64 startIndex, Int64 endIndex, Int32 length, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 168
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.ReadUnbufferedAsync[TResult](LogEntryConsumer`2 reader, Int64 startIndex, Nullable`1 endIndex, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 279
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.PreVoteAsync(Int64 currentTerm) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 535
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.DotNext.Net.Cluster.Consensus.Raft.IRaftStateMachine.MoveToCandidateState() in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 782
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_1(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

raft.log (1hr time shift from above, so same time):

2022-04-17 21:57:23.4282|WARN|leaderchange|New cluster leader is elected - Leader address 169.254.170.47:3450 - Term 5753 - Election timeout 00:00:00.2870000
2022-04-17 21:59:21.4900|WARN|leaderchange|Consensus cannot be reached - Term 5753 - Election timeout 00:00:00.2870000
2022-04-17 21:59:21.8211|WARN|leaderchange|New cluster leader is elected - Leader address 169.254.170.47:3450 - Term 5755 - Election timeout 00:00:00.1860000
2022-04-17 21:59:49.9832|ERROR|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Failed to process request from 169.254.170.47:52582

C.log (just another of our logs with events just before the crash):

04.17 20:59:21.492 - no leader detected, waiting for leader election before resuming decisions
04.17 20:59:21.821 - new leader detected, resuming decisions
04.17 20:59:49.989 - full vector was rejected by the leader, retrying
04.17 20:59:50.035 - full vector was rejected by the leader, retrying
04.17 20:59:50.086 - full vector was rejected by the leader, retrying
04.17 20:59:50.134 - full vector was rejected by the leader, retrying
04.17 20:59:50.187 - full vector was rejected by the leader, retrying

node 2 C.log (same as above

04.17 20:59:50.034 - rejecting full vector request - no longer leader
04.17 20:59:50.084 - rejecting full vector request - no longer leader
04.17 20:59:50.133 - rejecting full vector request - no longer leader
04.17 20:59:50.186 - rejecting full vector request - no longer leader

node2 raft.log:

2022-04-17 21:57:23.4189|WARN|leaderchange|New cluster leader is elected - Leader address 169.254.170.47:3450 - Term 5753 - Election timeout 00:00:00.2980000
2022-04-17 21:59:21.8022|WARN|leaderchange|Consensus cannot be reached - Term 5754 - Election timeout 00:00:00.2980000
2022-04-17 21:59:21.8193|WARN|leaderchange|New cluster leader is elected - Leader address 169.254.170.47:3450 - Term 5755 - Election timeout 00:00:00.2980000
2022-04-17 21:59:50.2901|WARN|leaderchange|Consensus cannot be reached - Term 5755 - Election timeout 00:00:00.2980000

freddyrios commented 2 years ago

Roman has stated this was already fixed in a later release to the one we used, so this being a low reproduction issue, the work here is to try upgrading to the latest version (4.4.1) keeping an eye on cluster stability on some of the failure scenarios we have tried in the past.

freddyrios commented 2 years ago

Our latest internal release is now using 4.4.1.

Although we don't have reproduction steps to validate this is fully gone, input from roman + code review makes this likely to be fixed now.

We'll roll out the release and follow up if we every run into these symptoms again.

copenhagenatomics / dotNext

crash due to unhandled exception #6