Closed aley1 closed 1 year ago
Every node in the cluster is actually identified by its ID. If node has different ID but the same address, it is considered as different node. This fact explains the behavior you mentioned:
Cluster starts with the following configuration:
A:1, State = Leader, Config = {A:1}, // because of coldStart
B:2, State = Standby, Config = ∅
C:3, State= Standby, Config = ∅
, where letters are addresses of the nodes, numbers are identifiers
Now advertise B
and C
to A
:
A:1, State = Leader, Config = {A:1, B:2, C:3}
B:2, State = Standby, Config = ∅
C:3, State = Standby, Config = ∅
In the next round of heartbeats (replication), the config will be the same on each node:
A:1, State = Leader, Config = {A:1, B:2, C:3}
B:2, State = Follower, Config = {A:1, B:2, C:3}
C:3, State = Follower, Config = {A:1, B:2, C:3}
Now restart node A
. Note that identifier is generated randomly when node starts unless specified explicitly.
A:4, State = Standby, Config = ∅
B:2, State = Follower, Config = {A:1, B:2, C:3}
C:3, State = Follower, Config = {A:1, B:2, C:3}
Assume that B
is a new leader:
A:4, State = Standby, Config = {A:1, B:2, C:3}
B:2, State = Leader, Config = {A:1, B:2, C:3}
C:3, State = Follower, Config = {A:1, B:2, C:3}
Restart B
:
A:4, State = Standby, Config = {A:1, B:2, C:3}
B:5, State = Standby, Config = ∅
C:3, State = Follower, Config = {A:1, B:2, C:3}
C
is a new leader:
A:4, State = Standby, Config = {A:1, B:2, C:3}
B:5, State = Standby, Config = {A:1, B:2, C:3}
C:3, State = Leader, Config = {A:1, B:2, C:3}
If we restart C
, all three nodes will be in Standby
state. This is happening because the node turns from Standby
to Follower
if it is able to recognize itself in the replicated configuration. But A:1 ≠ A:4
, so node A
can't join the cluster.
Thanks very much for explaining the issue. "Note that identifier is generated randomly when node starts unless specified explicitly." - how do I specify a fixed identifier for each node? I believe that will solve the problem when a node restarts and rejoin the cluster with the same Identifier.
Use link from my previous post.
However, I think that there is a room for improvement. The current implementation is too restrictive about members with unmatched IDs.
Thanks very much for explaining the issue. "Note that identifier is generated randomly when node starts unless specified explicitly." - how do I specify a fixed identifier for each node? I believe that will solve the problem when a node restarts and rejoin the cluster with the same Identifier.
Also, the cluster behaves just like as I described previously because you using in-memory configuration storage. Change it to persistent storage and everything will be fine.
@aley1 , could you please try the latest version from develop
. Now the problem is gone even if IDs are not specified explicitly. Also, I removed Id
property from IClusterMemberConfiguration
interface.
I have tried the latest version and it works without specifying IDs. A question on how this new approach works. 3 nodes - A (IP: 10.1.1.1) B (10.1.1.2) C (10.1.1.3) forms cluster. A leaves cluster. A rejoins again but with new IP address (10.1.1.4). I presume cluster membership list is now: A (IP: 10.1.1.1) B (10.1.1.2) C (10.1.1.3) A (IP: 10.1.1.4) which is now a 4 node cluster. So I need to trigger RemoveMemberSync to remove (IP: 10.1.1.1) to change it back to a 3 node cluster.
Correct. Because a new A is not actually old A. They have different addresses, so treated as different nodes. ID
now is not generated randomly and just acts as a shortcut for network address. The same address produces the same ID. Cluster node, if unavailable, can be removed automatically by the leader without explicit call of RemoveMemberAsync
. Check this article for more information.
Thanks. I've read the article and intend to use the PhiAccrualFailureDetector , but I having trouble writing the code to set this failure detector for the cluster. I tried looking up on setting init properties and the Func delegate but not sure of the syntax. Any tips here?
In case of DI, you need to register Func<IRaftClusterMember, IFailureDetector>
delegate implementation as a singleton. Without DI, just pass the factory to FailureDetectorFactory
property when instantiating a new RaftCluster
class.
Inside of the factory, just returns a new instance of PhiAccrualFailureDetector
with appropriate configuration.
FailureDetectorFactory is expecting Func<RaftClusterMember, IFailureDetector> My lame attempt (sorry I am not an expert in c#)
var cluster = new RaftCluster(configuration) { FailureDetectorFactory = CreateDetector(), };
static Func<RaftClusterMember, PhiAccrualFailureDetector> CreateDetector()
{
var detector = new PhiAccrualFailureDetector();
// Sorry this part I am not sure how to code it.
return detector;
}
Or just use lambda syntax:
var cluster = new RaftCluster(configuration)
{
FailureDetectorFactory = static member => new PhiAccrualFailureDetector { Threshold = 3D };
};
Thanks! It works now. I'll be testing the failure detection and might have some questions about it.
Tested with these scenarios 4 nodes - 127.0.0.1:3261 (coldstart), 3262, 3263, 3264
It looks like the coldstart node is not removed by the FailureDetector
Here is the root cause: https://github.com/dotnet/dotNext/blob/4065800ef5fcab54539e0144d071032ada2e7961/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/LeaderState.cs#L140-L143
GetResult
throw MemberUnavailableException
at the first request to ex-leader. In the same time, ClusterFailureDetector
uses lazy initialization of its internal dictionary with failure detectors for each member. Lazy initialization is implemented by ReportHeartbeat
method, which is never called due to exception mentioned previously. As a result, ex-leader never participates in the calculations implemented by the detector :disappointed:
To be more precise, this problem is not related to the status of node in the past.
I see no way to resolve this issue at the moment. Without the initial heartbeat, phi accrual failure detector treats the node as alive. For instance, we starting two nodes A and B, but the configuration consists of three nodes A, B, C. Assume that A is chosen as a leader. What it should do with C? It has never answered on the request. Should we consider it as dead immediately? Phi accrual detector can't provide the answer.
Now for the same configuration you can try the following:
var cluster = new RaftCluster(configuration)
{
FailureDetectorFactory = static (estimate, member) => new PhiAccrualFailureDetector(estimate) { Threshold = 3D, TreatUnknownValueAsUnhealthy = true };
};
TreatUnknownValueAsUnhealthy = true
is suitable for situations when the cluster is bootstrapping using coldStart
mechanism. If you have pre-populated list of nodes, the failure detector will remove all unstarted node at the time of first heartbeat round.
I have tested the changes, and it works as expected. Thanks for fixing this. I planned to use the coldStart mechanism so this is suitable for me.
Hi @sakno
Need your advice if this is a bug or if I am doing something incorrectly. I've attached the VS solution for the console app used to reproduce error.
3 nodes: RaftNode1 - port 3261 (coldstart = true), RaftNode2 - port 3262, RaftNode3 - port 3263
Cluster setup steps:
Open 3 terminals and run each of the following on separate terminals:
Copy the Local Member Id displayed on console for RaftNode2
In console for RaftNode1, enter 'a', enter the local member Id and port number for RaftNode2
Copy the Local Member Id displayed on console for RaftNode3
In console for RaftNode1, enter 'a', enter the local member Id and port number for RaftNode3
All 3 nodes are now part of cluster, and new leader elected. (leader should be RaftNode1)
Steps to reproduce error:
This error can be reproduced each time.
RaftTest.zip