unexpected election timeouts when rejecting vote requests in raft

freddyrios commented 2 years ago

When handling vote requests in the following method, the election timeout when rejecting vote requests (when line 544 is false): https://github.com/dotnet/dotNext/blob/6b01f63db0201f6a0303027a3ca03e99ff22ef15/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs#L516-L552

Note that this happens both explicitely in line 533, and indirectly in line 529. It seems to be problematic in both cases, although in the later when it is a leader it does need to start a new election timeout (it is supposed to change though, instead of remaining the same which seems to be the case following the implementation of StepDown).

Some of this is not explicit at all in the raft papers, not even the phd dissertation. However, one can simulate various scenarios in the raft visualization by sending requests/stopping/resuming/timing out nodes and seeing how exactly it deals with those cases https://raft.github.io/

More info on some of the scenarios related to the votes being rejected with the checks in line 529:

(low impact) we already voted for a node and a second node timed out (goes to line 533). The node first saw the election when it first voted, so it resets its election timeout accordingly (so it can propose itself as a leader if the candidate fails to be established as a leader). By resetting it on the second vote, we are extending our election timeout a second time potentially beyond a single election time window. It reduces our chances of becoming a candidate if the election fails.
(high impact) we have committed entries the vote requester does not have. Without the reset it is now more likely to time out first before the sender (which will reset its timeout after losing the election) + it is also more likely to time out before other nodes that don't have the committed entries (as the accepted vote would reset their election timeout). By resetting here we are removing the advante of nodes with committed entries, so more rounds are needed until by chance one of the nodes with the committed entries becomes the leader.

Consider how the later scenario can play when there are only 2 out of 3 nodes remaining. One of them has an entry the other one does not. If one is very unlucky with timings then the node without the extra entries keeps becoming a candidate first on many rounds. This alone does not match the many rounds I have seen it take, but this might combine with other implementation details in dotnext raft, as maybe an unlucky election timeout might keep getting reused.

freddyrios commented 2 years ago

@sakno I will likely be trying this change locally, so I could turn it into a PR. A follow up I might be doing depending on the results is exploring how the reuse of timeouts plays with this and other raft behaviors (but that should likely be a topic for a separate time).

sakno commented 2 years ago

@freddyrios , thanks for your work! As for indicated potential issue, there are few protections that are implemented already:

PreVote message which is not covered in original Raft paper. See this doc for more details. This type of RPC call allows to prevent term inflation.
Leader stickiness: https://github.com/dotnet/dotNext/blob/6b01f63db0201f6a0303027a3ca03e99ff22ef15/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs#L494-L498

These two extensions prevent the cluster from re-elections in case of unavailable leader for a short time. Candidate state has this additional step called pre-voting: https://github.com/dotnet/dotNext/blob/6b01f63db0201f6a0303027a3ca03e99ff22ef15/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs#L695-L729

So I expect that the situations described previously by you can be mitigated by this extra step. Anyway, PRs are welcome! You can submit changes and we can take a look at them together.

freddyrios commented 2 years ago

@sakno thanks for the reply.

I agree prevote prevents the second scenario, since a node can't become a candidate (and send related vote requests) if the majority has more committed entries than the node.

I guess the first scenario could still happen if the leader does go down. So some nodes do not have a fair chance to win a new election if the current election ends with a split vote.

That said, we should probably close this issue, as I don't currently have a strong scenario where that is shown to be the root cause of issues. More below.

I found more about this: This alone does not match the many rounds I have seen it take, but this might combine with other implementation details in dotnext raft, as maybe an unlucky election timeout might keep getting reused.

Apparently depending on the size and frequency of entries being saved vs. some of the settings raft dotnext supports, the implementation is affected by some of the unstability issues we were seeing. This includes problems holding an elected leader.

Specifically, we found that tuning these greatly helped the stability of some of our test deployments:

TcpConfiguration.LowerElectionTimeout
TcpConfiguration.UpperElectionTimeout
TcpConfiguration.TransmissionBlockSize
MemoryBasedStateMachine.Options.BufferSize
MemoryBasedStateMachine.Options.InitialParitionSize

Note that increasing the election timeouts alone was not enough to achieve a cluster that is unlikely to go into recurrent elections and/or recurrent request timeouts. We see more stable behavior even when saving 8 KB entries when the parameters are turned compared to sending just the 8 bytes the example sends without changing those configurations. When we are sending 8 KB entries we increased the values of transmisssion block size, buffer size and initial partition size by 1000 compared to those in the example (as our data is that much larger, but no extra exploration has been done in a smaller increment being suitable). For the lower and upper election timeouts doubling the values was enough.

So the only thing I am left wondering about the unstability with smaller transmission block size, buffer size and initial partition size is if the cluster would still have stuck in recurrent elections in the same way without the behavior I mentioned in the second scenario of the ticket or if all nodes would have faced the same fate.

ps. with the current configuration killing only one node out of 3 tends to consistently recover, while killing 2 nodes out of 3 + restarting 1 killed one does not always recover succesfully (extra program restarts if it fails to start does tend to recover it).

sakno commented 2 years ago

while killing 2 nodes out of 3 + restarting 1 killed one does not always recover succesfully

It would be great if you have a working example that reproduces this situation.

sakno commented 2 years ago

One more question - is this behavior specific to TCP transport? Did you observe the same for UDP/HTTP?

sakno commented 2 years ago

Another potential reason is suboptimal implementation of TCP transport for Raft. UDP/TCP both shares the same transmission flow control procedure. Historically, my primary focus as HTTP implementation. TCP/UDP transports were added later. To reduce development efforts, I decided to share the same logic for both transports.

Transmission flow control is based on Exchange concept. Exchange is a dialogue between the client and the server. During the exchange, the client and the server send packets (a logical portion of data) to each other using Request-Reply pattern. These packets are then translated down to the transport layer. For instance, AppendEntries as the most complex RPC call has the following flow: https://github.com/dotnet/dotNext/blob/30e4188ef7cf7482a1eaa0872ef1ebfae2e79b57/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/EntriesExchange.cs#L13-L26

That was absolutely reasonable for UDP because it has no connection concept and packet ordering. As a result, packet ordering is guaranteed through a sequence of requests/acknowledgments. For example, transmission of small log entry that fits to single UDP packet requires REQ packet and REP packet The situation is much worse if a log entries doesn't fit to single UDP packet. In this situation, the transmission of a single log entry requires multiple REQ/REP packets. As a result, a single call of AppendEntries causes a bunch of packets to be transferred over the wire in both directions.

This architecture is not very efficient for TCP. Probably, it's time to rethink the current implementation.

freddyrios commented 2 years ago

while killing 2 nodes out of 3 + restarting 1 killed one does not always recover succesfully

It would be great if you have a working example that reproduces this situation.

I don't have a test, but it happens when running in a 3 cluster node with a modification of the example in this branch: https://github.com/freddyrios/dotNext/tree/feature/bigstruct-example. It only has one commit with notes on what changed, but the modified example is more of a fairly busy system, which could play a part.

One more question - is this behavior specific to TCP transport? Did you observe the same for UDP/HTTP?

I have not tried it with other transports.

freddyrios commented 2 years ago

thanks for the fixes, working great now

dotnet / dotNext

unexpected election timeouts when rejecting vote requests in raft #89