dotnet / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
1.6k stars 119 forks source link

(DotNext.Net.Cluster) Leader does not receive IRaftCluster.LeaderChanged event when downgrading to follower #98

Closed RyanTT closed 2 years ago

RyanTT commented 2 years ago

Hello,

I am currently trying to run the 4.2.0-beta.1 version and I have come across a specific issue regarding the log entry writing with Raft and the events associated with it

  1. Have a single node start a standalone cluster (first node)
  2. Add a second member with AddMember (second node)
  3. Kill the process of node 2
  4. Start the process of node 2 again

After step 3 and inspecting my logs, it seems like the leader steps down to follower but does not fire the IRaftCluster.LeaderChanged event. IRaftCluster.Members will still report the local (Remote == false) node 1 as leader on node 1, but attempting to write to the log now will result in

      System.InvalidOperationException: The local cluster member is not a leader
         at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.GetResult(Int16 token) in /_/src/DotNext.Threading/Threading/Tasks/ValueTaskCompletionSource.T.cs:line 272
         at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) in /_/src/DotNext.Threading/Threading/Tasks/ValueTaskCompletionSource.T.cs:line 279
         at DotNext.Net.Cluster.Consensus.Raft.LeaderState.ReplicationCallback.Invoke() in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/LeaderState.Replication.cs:line 177
      --- End of stack trace from previous location ---
         at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.ReplicateAsync[TEntry](TEntry entry, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 818

IClusterMember.MemberStatusChanged will correctly fire and set node 2 to Unavailable during step 3. This behavior was not present before the upgrade to 4.2.0-beta.1.

After step 4 is done, (on node 1) IClusterMember.MemberStatusChanged will correctly fire and mark node 2 as available again. However, node 1 is still unable to write to the log as it seemingly isn't leader anymore. Only some time after this step, node 1 will correctly fire IRaftCluster.LeaderChanged to NO LEADER, and then fire it again but WITH A LEADER.

sakno commented 2 years ago

@RyanTT , what kind of transport are you using? HTTP/UDP/TCP?

RyanTT commented 2 years ago

I am using HTTP transport.

sakno commented 2 years ago

@RyanTT , regression has been fixed.

RyanTT commented 2 years ago

I believe the issue is still present. I've updated my projects to beta.2 and the behavior has remained the same:

After step 3, node 1 is unable to write to the log System.InvalidOperationException: The local cluster member is not a leader.

sakno commented 2 years ago

@RyanTT , do you have a repro code for that?

RyanTT commented 2 years ago

Sorry for the late response. I'll attempt to write a test to reproduce this issue in the following days.

sakno commented 2 years ago

Hi @RyanTT , did you have a chance to reproduce the issue?

sakno commented 2 years ago

Closing this issue due to lack of activity.