Closed freddyrios closed 1 year ago
Leader address is 127.0.0.1:3260
How is it possible if there are only 3262, 3263, 3264 nodes?
@sakno sorry for the noise, I used the log messages of the 10 nodes test run because they looked the same and forgot about it. Here are the actual entries for the 3 nodes run:
New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 2. Election timeout 00:00:00.2510000
Accepting value 500
Accepting value 1000
Accepting value 1500
Accepting value 2000
New cluster leader is elected. Leader address is 127.0.0.1:3263
Term of local cluster member is 1. Election timeout 00:00:00.1820000
Consensus cannot be reached
Term of local cluster member is 1. Election timeout 00:00:00.1820000
New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 2. Election timeout 00:00:00.1820000
Accepting value 500
Accepting value 1000
Accepting value 1500
Accepting value 2000
New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 2. Election timeout 00:00:00.2560000
Saving value 500 generated by the leader node
Accepting value 500
Saving value 1000 generated by the leader node
Accepting value 1000
Saving value 1500 generated by the leader node
Accepting value 1500
Saving value 2000 generated by the leader node
Accepting value 2000
So the problem with 3263 node, right? It was elected as a leader, and that fact was not reflected by other nodes.
Yes, that is the issue.
On Sat, Jun 3, 2023, 09:20 SRV @.***> wrote:
So the problem with 3263 node, right? It was elected as a leader, and that fact was not reflected by other nodes.
— Reply to this email directly, view it on GitHub https://github.com/dotnet/dotNext/issues/168#issuecomment-1574742671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ42ZDNUD5QTQOCO2YD5623XJLQVPANCNFSM6AAAAAAYYAJOR4 . You are receiving this because you authored the thread.Message ID: @.***>
In theory, it's possible (and correct from Raft perspective). But it should be very rare situation. I'll take a look.
Maybe related PR #170
Can you turn on Debug
severity and attach logs for each node separately?
forgot to add that I set logging to trace and did not get anything extra
Example code uses Console.WriteLine
instead of logging infrastructure. It seems like logs were not written at all (or written to another source, such as file instead of stdio).
Turned out a lot of the logging is only enabled for http.
Here are the details of one of the runs after enabling it for tcp.
Some things that I found interesting:
term 0 is skipped, as nodes transitioning to leader change to term 1
This is fine. Term 0 is an initial term for every fresh node. Transition to Candidate increases the term.
- at 41.8900 node 4 says twice it downgrades to follower in term1 and that "Election timeout is refreshed". But then incorrectly inmediately transitions to candidate.
I missed it says Member is downgrading vs. Member is downgraded / so its not double message of that. The timeout message does show twice though.
I found a way to repro the issue with an isolated unit test (see commit above). Test fails and reports that some nodes in the cluster see different leaders immediately after start.
I also tracked down on my side some of it.
At least in the last capture I made the election timer trigers 35ms before the node transitions to Candidate. In between those 2 steps, the node can vote for others.
I forgot to add ColdStart=false
to configuration. Now the issue is not reproducible. You can try the same using TcpTransportTests.ConcurrentElection()
test.
At least in the last capture I made the election timer trigers 35ms before the node transitions to Candidate. In between those 2 steps, the node can vote for others.
I did not have time to post the details yesterday.
It seems the root cause is that in the process of becoming a candidate the node can still grant votes, and when it does it still becomes a candidate.
One potential solution is in RaftCluster.MoveToCandidateState, after taking the transitionSync lock, to abort becoming a candidate if FollowerState.Refresh was called after the timeout was set. PreVote already seems to work similarly in there. It seems the lock would prevent votes while we are checking this, which should remove the ambiguity on voting vs. becoming a candidate. Additionally it seems FollowerState.Refresh and related caller logic is called under the lock too, so it could help avoiding similar issues in other uses that expect the timeout to have been refreshed if the node was still a follower.
This is the capture from the node that I used to track it down:
2023-06-05 16:24:35.8528|INFO|Program|calling StartAsync
2023-06-05 16:24:36.1754|INFO|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|timer timed out, transitioning to candidate state!
2023-06-05 16:24:36.1840|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgrading to follower state with term 1
2023-06-05 16:24:36.1840|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Election timeout is refreshed
2023-06-05 16:24:36.1840|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgraded to follower state with term 1
2023-06-05 16:24:36.2010|INFO|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Transition to Candidate state has started with term 1
2023-06-05 16:24:36.2010|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Voting is started with timeout 258 and term 2
The "timer timed out" log entry is an entry I added to FollowerState
Here is my interpretation of it:
One potential solution is in RaftCluster.MoveToCandidateState, after taking the transitionSync lock, to abort becoming a candidate if FollowerState.Refresh was called after the timeout was set. PreVote already seems to work similarly in there. It seems the lock would prevent votes while we are checking this, which should remove the ambiguity on voting vs. becoming a candidate. Additionally it seems FollowerState.Refresh and related caller logic is called under the lock too, so it could help avoiding similar issues in other uses that expect the timeout to have been refreshed if the node was still a follower.
I'll check this out. However, I can tell that this part of the algorithm remains unchanged for a long time. Also, I recommend you to check develop
branch because of #170, which probably is a source of competing log entries.
FYI I pulled the latest develop after my last message and reproduced it again.
By the way, the issue is present at least since version 4.4.1 which is the one we have been running, so it makes sense it relates to code that has not changed over time. We had some issues on our side that made it harder to notice the issue.
Fixed as following: check whether the refresh has been requested after the lock inside of MoveToCandidateState
. If so, resume Follower state instead of moving to Candidate.
works great, thanks!
ran the example reproduction at least 15 times and in most cases it properly elects the leader in term 1. The other case was when all nodes became candidates close to each others and rejected each others votes and then elected a leader in term 2 (as expected in raft).
Perfect, I'll publish a new release today.
Fixed in 4.12.2
The issue can be reproduced with the example project, by building it locally and adding a cluster.bat to the output folder that has the following lines:
Run:
del -r node* && .\cluster.bat
I can reproduce it with 2-6 attempts in a windows x64 machine. The issue was originally found in a raspberry pi (arm - linux).
The leader prints this its log:
Other nodes print (+ the leader prints the save messages)
When done with the run, the RaftNode processes need to be killed via task manager since they are running in the background.
Originally posted by @freddyrios in https://github.com/dotnet/dotNext/discussions/167#discussioncomment-6062222