short lived leader election reported when starting a raft cluster

freddyrios commented 1 year ago

The issue can be reproduced with the example project, by building it locally and adding a cluster.bat to the output folder that has the following lines:

START /B .\RaftNode.exe tcp 3262 node2 > node2.log
START /B .\RaftNode.exe tcp 3263 node3 > node3.log
START /B .\RaftNode.exe tcp 3264 node4 > node4.log

Run: del -r node* && .\cluster.bat

I can reproduce it with 2-6 attempts in a windows x64 machine. The issue was originally found in a raspberry pi (arm - linux).

The leader prints this its log:

New cluster leader is elected. Leader address is 127.0.0.1:3260
Term of local cluster member is 1. Election timeout 00:00:00.1590000
Consensus cannot be reached
Term of local cluster member is 1. Election timeout 00:00:00.1590000
New cluster leader is elected. Leader address is 127.0.0.1:3267
Term of local cluster member is 2. Election timeout 00:00:00.1590000
Accepting value 500
Accepting value 1000
Accepting value 1500
Accepting value 2000
...

Other nodes print (+ the leader prints the save messages)

New cluster leader is elected. Leader address is 127.0.0.1:3267
Term of local cluster member is 2. Election timeout 00:00:00.1700000
Accepting value 500
Accepting value 1000
Accepting value 1500
Accepting value 2000
...

When done with the run, the RaftNode processes need to be killed via task manager since they are running in the background.

Originally posted by @freddyrios in https://github.com/dotnet/dotNext/discussions/167#discussioncomment-6062222

sakno commented 1 year ago

Leader address is 127.0.0.1:3260

How is it possible if there are only 3262, 3263, 3264 nodes?

freddyrios commented 1 year ago

@sakno sorry for the noise, I used the log messages of the 10 nodes test run because they looked the same and forgot about it. Here are the actual entries for the 3 nodes run:

New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 2. Election timeout 00:00:00.2510000
Accepting value 500
Accepting value 1000
Accepting value 1500
Accepting value 2000

New cluster leader is elected. Leader address is 127.0.0.1:3263
Term of local cluster member is 1. Election timeout 00:00:00.1820000
Consensus cannot be reached
Term of local cluster member is 1. Election timeout 00:00:00.1820000
New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 2. Election timeout 00:00:00.1820000
Accepting value 500
Accepting value 1000
Accepting value 1500
Accepting value 2000

New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 2. Election timeout 00:00:00.2560000
Saving value 500 generated by the leader node
Accepting value 500
Saving value 1000 generated by the leader node
Accepting value 1000
Saving value 1500 generated by the leader node
Accepting value 1500
Saving value 2000 generated by the leader node
Accepting value 2000

sakno commented 1 year ago

So the problem with 3263 node, right? It was elected as a leader, and that fact was not reflected by other nodes.

freddyrios commented 1 year ago

Yes, that is the issue.

On Sat, Jun 3, 2023, 09:20 SRV @.***> wrote:

So the problem with 3263 node, right? It was elected as a leader, and that fact was not reflected by other nodes.

— Reply to this email directly, view it on GitHub https://github.com/dotnet/dotNext/issues/168#issuecomment-1574742671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ42ZDNUD5QTQOCO2YD5623XJLQVPANCNFSM6AAAAAAYYAJOR4 . You are receiving this because you authored the thread.Message ID: @.***>

sakno commented 1 year ago

In theory, it's possible (and correct from Raft perspective). But it should be very rare situation. I'll take a look.

sakno commented 1 year ago

Maybe related PR #170

sakno commented 1 year ago

Can you turn on Debug severity and attach logs for each node separately?

freddyrios commented 1 year ago

switched to file logging that also shows times + added an extra log message just before calling StartAsync. node4.log node2.log node3.log

freddyrios commented 1 year ago

forgot to add that I set logging to trace and did not get anything extra

sakno commented 1 year ago

Example code uses Console.WriteLine instead of logging infrastructure. It seems like logs were not written at all (or written to another source, such as file instead of stdio).

freddyrios commented 1 year ago

Turned out a lot of the logging is only enabled for http.

Here are the details of one of the runs after enabling it for tcp.

node2.log node3.log node4.log

Some things that I found interesting:

term 0 is skipped, as nodes transitioning to leader change to term 1
at 41.8886 node 3 logs that, in term 1, node 2 rejected the vote but node 4 approved it
node 2 accepts the loss by downgrading to follower at 41.9006, while node 3 says it became a leader at 41.9008
node 4 first raft log is at 41.8900 which is .0014 after it suppodsely voted for node 3.
at 41.8900 node 4 says twice it downgrades to follower in term1 and that "Election timeout is refreshed". But then incorrectly inmediately transitions to candidate.
later node 4 starts and wins a vote in term 2 as normal.
there are still some funny artifacts at this point though, for example:
1. node 4 reports node 3 rejection 29ms after node 2 approved the vote
2. (maybe expected) node 3 reports a leader change without a leader with term 1 / when we have already entered term 2 election
3. node 3 reports "Leader state reverted with error" ... a task was canceled.

sakno commented 1 year ago

term 0 is skipped, as nodes transitioning to leader change to term 1

This is fine. Term 0 is an initial term for every fresh node. Transition to Candidate increases the term.

freddyrios commented 1 year ago

at 41.8900 node 4 says twice it downgrades to follower in term1 and that "Election timeout is refreshed". But then incorrectly inmediately transitions to candidate.

I missed it says Member is downgrading vs. Member is downgraded / so its not double message of that. The timeout message does show twice though.

sakno commented 1 year ago

I found a way to repro the issue with an isolated unit test (see commit above). Test fails and reports that some nodes in the cluster see different leaders immediately after start.

freddyrios commented 1 year ago

I also tracked down on my side some of it.

At least in the last capture I made the election timer trigers 35ms before the node transitions to Candidate. In between those 2 steps, the node can vote for others.

sakno commented 1 year ago

I forgot to add ColdStart=false to configuration. Now the issue is not reproducible. You can try the same using TcpTransportTests.ConcurrentElection() test.

freddyrios commented 1 year ago

At least in the last capture I made the election timer trigers 35ms before the node transitions to Candidate. In between those 2 steps, the node can vote for others.

I did not have time to post the details yesterday.

It seems the root cause is that in the process of becoming a candidate the node can still grant votes, and when it does it still becomes a candidate.

One potential solution is in RaftCluster.MoveToCandidateState, after taking the transitionSync lock, to abort becoming a candidate if FollowerState.Refresh was called after the timeout was set. PreVote already seems to work similarly in there. It seems the lock would prevent votes while we are checking this, which should remove the ambiguity on voting vs. becoming a candidate. Additionally it seems FollowerState.Refresh and related caller logic is called under the lock too, so it could help avoiding similar issues in other uses that expect the timeout to have been refreshed if the node was still a follower.

This is the capture from the node that I used to track it down:

2023-06-05 16:24:35.8528|INFO|Program|calling StartAsync
2023-06-05 16:24:36.1754|INFO|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|timer timed out, transitioning to candidate state!
2023-06-05 16:24:36.1840|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgrading to follower state with term 1
2023-06-05 16:24:36.1840|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Election timeout is refreshed
2023-06-05 16:24:36.1840|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgraded to follower state with term 1
2023-06-05 16:24:36.2010|INFO|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Transition to Candidate state has started with term 1
2023-06-05 16:24:36.2010|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Voting is started with timeout 258 and term 2

The "timer timed out" log entry is an entry I added to FollowerState.Track when it exits the loop and sets timedOut = true.

Here is my interpretation of it:

at 36.1754 FollowerState.Track times out and begins the move to candidate state
The move to candidate state involves these actions: queue in ThreadPool the move to candidate, work is picked up and PreVote runs (at least another node needs to prevote), takes a lock on transitionSync, checks the FollowState { IsExpired: true}, logs "Transition to Candidate state has started with term 1", increments term & actually becomes a candidate, starts voting
notice that between "timer timed out" and "transition to Candidate has started" at 36.2010 there are 25 ms (35ms was a mistake in my msg yesterday)
The log of the short lived leader (not included above) logs at 36.1953 that the problem node granted it the vote.
The other messages at 36.1840 must be when the problem node was granting the above vote
The above means the problem node was granting votes when it had already decided to become a candidate. When granting votes the node is supposed to reset its timer, which it tries to do via FollowerState.Refresh, but that is just ignored as the timeout had already triggered

sakno commented 1 year ago

One potential solution is in RaftCluster.MoveToCandidateState, after taking the transitionSync lock, to abort becoming a candidate if FollowerState.Refresh was called after the timeout was set. PreVote already seems to work similarly in there. It seems the lock would prevent votes while we are checking this, which should remove the ambiguity on voting vs. becoming a candidate. Additionally it seems FollowerState.Refresh and related caller logic is called under the lock too, so it could help avoiding similar issues in other uses that expect the timeout to have been refreshed if the node was still a follower.

I'll check this out. However, I can tell that this part of the algorithm remains unchanged for a long time. Also, I recommend you to check develop branch because of #170, which probably is a source of competing log entries.

freddyrios commented 1 year ago

FYI I pulled the latest develop after my last message and reproduced it again.

freddyrios commented 1 year ago

By the way, the issue is present at least since version 4.4.1 which is the one we have been running, so it makes sense it relates to code that has not changed over time. We had some issues on our side that made it harder to notice the issue.

sakno commented 1 year ago

Fixed as following: check whether the refresh has been requested after the lock inside of MoveToCandidateState. If so, resume Follower state instead of moving to Candidate.

freddyrios commented 1 year ago

works great, thanks!

ran the example reproduction at least 15 times and in most cases it properly elects the leader in term 1. The other case was when all nodes became candidates close to each others and rejected each others votes and then elected a leader in term 2 (as expected in raft).

sakno commented 1 year ago

Perfect, I'll publish a new release today.

sakno commented 1 year ago

Fixed in 4.12.2

dotnet / dotNext

short lived leader election reported when starting a raft cluster #168