Cluster fails to elect new leader after nodes leave and rejoin cluster

aley1 commented 1 year ago

Hi @sakno

Need your advice if this is a bug or if I am doing something incorrectly. I've attached the VS solution for the console app used to reproduce error.

3 nodes: RaftNode1 - port 3261 (coldstart = true), RaftNode2 - port 3262, RaftNode3 - port 3263

Cluster setup steps:

Open 3 terminals and run each of the following on separate terminals:

raftnode.exe 3261 true raftnode.exe 3262 false raftnode.exe 3263 false
Copy the Local Member Id displayed on console for RaftNode2
In console for RaftNode1, enter 'a', enter the local member Id and port number for RaftNode2
Copy the Local Member Id displayed on console for RaftNode3
In console for RaftNode1, enter 'a', enter the local member Id and port number for RaftNode3
All 3 nodes are now part of cluster, and new leader elected. (leader should be RaftNode1)

Steps to reproduce error:

Terminate / stop RaftNode1 (leader)
New leader should be elected between RaftNode 2 / 3
Open new terminal and run raftnode.exe 3261 false (to simulate RaftNode1 rejoining cluster - this time with coldstart = false)
RaftNode1 rejoins cluster and is now a follower
Find out which is the current leader, and then terminate the leader. Here we in the steps we assume RaftNode2 is the leader and we stop it.
New leader should be elected between RaftNode1 / 3 (let’s assume its RaftNode3)
Run RaftNode2 again and it will rejoin the cluster
Terminate RaftNode3 (current leader)
Error here: No new leader is elected

This error can be reproduced each time.

RaftTest.zip

sakno commented 1 year ago

Every node in the cluster is actually identified by its ID. If node has different ID but the same address, it is considered as different node. This fact explains the behavior you mentioned:

Cluster starts with the following configuration:

A:1, State = Leader, Config = {A:1}, // because of coldStart
B:2, State = Standby, Config =  ∅
C:3, State= Standby, Config =  ∅

, where letters are addresses of the nodes, numbers are identifiers

Now advertise B and C to A:

A:1, State = Leader, Config = {A:1, B:2, C:3}
B:2, State = Standby, Config =  ∅
C:3, State = Standby, Config =  ∅

In the next round of heartbeats (replication), the config will be the same on each node:

A:1, State = Leader, Config = {A:1, B:2, C:3}
B:2, State = Follower, Config =  {A:1, B:2, C:3}
C:3, State = Follower, Config =  {A:1, B:2, C:3}

Now restart node A. Note that identifier is generated randomly when node starts unless specified explicitly.

A:4, State = Standby, Config = ∅
B:2, State = Follower, Config =  {A:1, B:2, C:3}
C:3, State = Follower, Config =  {A:1, B:2, C:3}

Assume that B is a new leader:

A:4, State = Standby, Config = {A:1, B:2, C:3}
B:2, State = Leader, Config =  {A:1, B:2, C:3}
C:3, State = Follower, Config =  {A:1, B:2, C:3}

Restart B:

A:4, State = Standby, Config = {A:1, B:2, C:3}
B:5, State = Standby, Config =  ∅
C:3, State = Follower, Config =  {A:1, B:2, C:3}

C is a new leader:

A:4, State = Standby, Config = {A:1, B:2, C:3}
B:5, State = Standby, Config =  {A:1, B:2, C:3}
C:3, State = Leader, Config =  {A:1, B:2, C:3}

If we restart C, all three nodes will be in Standby state. This is happening because the node turns from Standby to Follower if it is able to recognize itself in the replicated configuration. But A:1 ≠ A:4, so node A can't join the cluster.

aley1 commented 1 year ago

Thanks very much for explaining the issue. "Note that identifier is generated randomly when node starts unless specified explicitly." - how do I specify a fixed identifier for each node? I believe that will solve the problem when a node restarts and rejoin the cluster with the same Identifier.

sakno commented 1 year ago

Use link from my previous post.

sakno commented 1 year ago

However, I think that there is a room for improvement. The current implementation is too restrictive about members with unmatched IDs.

sakno commented 1 year ago

Thanks very much for explaining the issue. "Note that identifier is generated randomly when node starts unless specified explicitly." - how do I specify a fixed identifier for each node? I believe that will solve the problem when a node restarts and rejoin the cluster with the same Identifier.

Also, the cluster behaves just like as I described previously because you using in-memory configuration storage. Change it to persistent storage and everything will be fine.

sakno commented 1 year ago

@aley1 , could you please try the latest version from develop. Now the problem is gone even if IDs are not specified explicitly. Also, I removed Id property from IClusterMemberConfiguration interface.

aley1 commented 1 year ago

I have tried the latest version and it works without specifying IDs. A question on how this new approach works. 3 nodes - A (IP: 10.1.1.1) B (10.1.1.2) C (10.1.1.3) forms cluster. A leaves cluster. A rejoins again but with new IP address (10.1.1.4). I presume cluster membership list is now: A (IP: 10.1.1.1) B (10.1.1.2) C (10.1.1.3) A (IP: 10.1.1.4) which is now a 4 node cluster. So I need to trigger RemoveMemberSync to remove (IP: 10.1.1.1) to change it back to a 3 node cluster.

sakno commented 1 year ago

Correct. Because a new A is not actually old A. They have different addresses, so treated as different nodes. ID now is not generated randomly and just acts as a shortcut for network address. The same address produces the same ID. Cluster node, if unavailable, can be removed automatically by the leader without explicit call of RemoveMemberAsync. Check this article for more information.

aley1 commented 1 year ago

Thanks. I've read the article and intend to use the PhiAccrualFailureDetector , but I having trouble writing the code to set this failure detector for the cluster. I tried looking up on setting init properties and the Func delegate but not sure of the syntax. Any tips here?

sakno commented 1 year ago

In case of DI, you need to register Func<IRaftClusterMember, IFailureDetector> delegate implementation as a singleton. Without DI, just pass the factory to FailureDetectorFactory property when instantiating a new RaftCluster class.

Inside of the factory, just returns a new instance of PhiAccrualFailureDetector with appropriate configuration.

aley1 commented 1 year ago

FailureDetectorFactory is expecting Func<RaftClusterMember, IFailureDetector> My lame attempt (sorry I am not an expert in c#)

var cluster = new RaftCluster(configuration) { FailureDetectorFactory = CreateDetector(), };

static Func<RaftClusterMember, PhiAccrualFailureDetector> CreateDetector() { var detector = new PhiAccrualFailureDetector(); // Sorry this part I am not sure how to code it. return detector;
}

sakno commented 1 year ago

https://github.com/dotnet/dotNext/blob/eea292409a58619def77442b50a95033bab12d8a/src/DotNext.Tests/Net/Cluster/Consensus/Raft/Http/RaftHttpClusterTests.cs#L400-L401

Or just use lambda syntax:

var cluster = new RaftCluster(configuration)
{
    FailureDetectorFactory = static member => new PhiAccrualFailureDetector { Threshold = 3D };
};

aley1 commented 1 year ago

Thanks! It works now. I'll be testing the failure detection and might have some questions about it.

aley1 commented 1 year ago

Tested with these scenarios 4 nodes - 127.0.0.1:3261 (coldstart), 3262, 3263, 3264

Start all 4 nodes and join them
cluster has 4 members
Close node 4. cluster has 3 members
Add back node 4 - cluster has 4 members
Close node 4
Close node 3.
Add node 3 back - cluster has 3 members
Add node 4 back - cluster has 4 members
Close node1 (leader and coldstart node)
cluster state shows it still has 4 members <-- should be 3 members
Close node 2 - cluster fails even though 3 member cluster should be available with 2 members

It looks like the coldstart node is not removed by the FailureDetector

RaftTest.zip

sakno commented 1 year ago

Here is the root cause: https://github.com/dotnet/dotNext/blob/4065800ef5fcab54539e0144d071032ada2e7961/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/LeaderState.cs#L140-L143

GetResult throw MemberUnavailableException at the first request to ex-leader. In the same time, ClusterFailureDetector uses lazy initialization of its internal dictionary with failure detectors for each member. Lazy initialization is implemented by ReportHeartbeat method, which is never called due to exception mentioned previously. As a result, ex-leader never participates in the calculations implemented by the detector :disappointed:

To be more precise, this problem is not related to the status of node in the past.

sakno commented 1 year ago

I see no way to resolve this issue at the moment. Without the initial heartbeat, phi accrual failure detector treats the node as alive. For instance, we starting two nodes A and B, but the configuration consists of three nodes A, B, C. Assume that A is chosen as a leader. What it should do with C? It has never answered on the request. Should we consider it as dead immediately? Phi accrual detector can't provide the answer.

sakno commented 1 year ago

Now for the same configuration you can try the following:

var cluster = new RaftCluster(configuration)
{
    FailureDetectorFactory = static (estimate, member) => new PhiAccrualFailureDetector(estimate) { Threshold = 3D, TreatUnknownValueAsUnhealthy = true };
};

TreatUnknownValueAsUnhealthy = true is suitable for situations when the cluster is bootstrapping using coldStart mechanism. If you have pre-populated list of nodes, the failure detector will remove all unstarted node at the time of first heartbeat round.

aley1 commented 1 year ago

I have tested the changes, and it works as expected. Thanks for fixing this. I planned to use the coldStart mechanism so this is suitable for me.

dotnet / dotNext

Cluster fails to elect new leader after nodes leave and rejoin cluster #153