atomix / copycat

A novel implementation of the Raft consensus algorithm
http://atomix.io/copycat
Apache License 2.0
581 stars 155 forks source link

Support for a Leader to leave the cluster cleanly #20

Closed madjam closed 9 years ago

madjam commented 9 years ago

When a node that is the current leader leaves the cluster it will do so without properly updating the cluster configuration. This happens because the state transition from LEADER to LEAVE prevents committing the updating cluster configuration.

kuujo commented 9 years ago

So, you're obviously right and this is a bug. Really, I suppose the leader needs to step down and wait for a new leader to be elected before it can try to leave the cluster. There are definitely some behavioral concerns here. What if, e.g., the leader is one of 2/3 nodes? Presumably, no new leader will be elected and so the leader can't remove itself from the cluster. That's correct behavior, but I wonder how that should be exposed to the user. It's not currently clear whether or not a node successfully left a cluster (the cluster was reconfigured) or just timed out and shut down, and that's important information since the reconfiguration effects the quorum size. Should the close() future never complete if the cluster can't be reconfigured? Should the future be completed exceptionally? If the future is completed exceptionally, should the server still be shut down or should it remain active?

I think we can do a rudimentary implementation for this issue that simply extends the current algorithm to step down and wait for a leader to be elected or time out after n election timeouts, but there are certainly a lot of concerns around reconfiguration that need to be addressed.

kuujo commented 9 years ago

Ugh. I gotta stop writing in issues on my phone. That first word is "so." Auto-correct is the bane of my existence!

madjam commented 9 years ago

Fully agree that this is an issue with a lot of interesting scenarios.

My understanding of the Raft protocol's behavior around cluster configuration is that when the current leader wants to leave it can first commit the updated configuration and then step down. Here is the relevant snippet from the paper: The second issue is that the cluster leader may not be part of the new configuration. In this case, the leader steps down (returns to follower state) once it has committed the Cnew log entry.

My fear is that exiting no matter what after a timeout can result in a cluster that can no longer reach consensus. For example if we have a 2 node cluster and one node decides to leave then the remaining node cannot do anything useful. For a 2+ node cluster we will have a slightly improved tolerance for this behavior: in a 3 node cluster the remaining 2 can still do useful work. But the the moment another node decides to just leave we are in trouble.

My preference would be to block the close() until a node can successfully leave. At least that way we are safe. And additionally also support another entry point to cluster reconfiguration. This can be useful in a 3 node scenario where one node just crashed and died and the admin wants to reconfigure the cluster so that it now only has 2 nodes.

kuujo commented 9 years ago

Alright, here's what I came up with. I modified the LeaveState to attempt to leave the cluster over up to three election timeouts. This ensures that in most cases a leader will be elected and the leaving node will be able to reconfigure the cluster. In the event that a new leader cannot be elected quickly enough (unlikely, particularly with the pre-vote protocol) or the server transitioning to the leave state results in a loss of quorum, the server will time out and the close() future will be completed exceptionally with a new server ConfigurationException. I think this gives us the benefit of ensuring servers can leave if possible and users are notified if the reconfiguration fails.

There are still some cases where this is not ideal. For instance, if 2/3 nodes in a cluster are alive, and one of the servers leaves, it can't participate in the commitment of the configuration that removes it from the cluster since it's in the leave state which doesn't receive append entries RPCs or vote in elections. However, perhaps it's not such a bad thing that this algorithm will effectively disallow removing a quorum from the cluster. That is, if 2/3 nodes are alive, and one is removed, an exception will be thrown in that exceptional case.

kuujo commented 9 years ago

Bah I missed your comment referring to exactly the scenario I mentioned - a server leaving a cluster where 2/3 nodes are alive.

The use of LeaveState definitely prevents some reconfiguration. I think the issue of a server needing to participate in the quorum to remove itself should definitely be handled. I'll probably commit the changes I have and then work on improving upon that to handle all configuration cases. It might be the case that LeaveState needs to be replaced by some state-specific method to allow a leader to leave the cluster differently than a follower. In that case, essentially every server would commit the configuration before transitioning at all. That probably makes more sense and ensures servers contribute to quorums until the reconfiguration is complete. The alternative to a leader committing its own configuration change is transitioning to follower instead of the LeaveState and committing the reconfiguration through a new leader, but i don't see much benefit to that.

With respect to blocking on close(), I think because of the limitations of the current implementation - even with the commit I'm about to push - it should not block on close since closing a server can result in a loss of quorum from the transition to LeaveState, and that could essentially mean a deadlock. Instead, I'll leave the ConfigurationException to handle this case for the time being and work on replacing LeaveState with a more elegant solution that allows servers to commit their configuration change before dropping out of the quorum.

kuujo commented 9 years ago

I suppose we could say that the goal for this feature should be to allow clusters to scale from 1-3 servers and back down with failures taken into account. Currently, a cluster can scale from 3-2 nodes only if all three nodes are alive and healthy, and a cluster cannot scale down from 2 nodes to 1 node since removing the leaving server from the quorum will prevent the reconfiguration.

I'll take a stab at moving the reconfiguration logic into passive/follower/candidate/leader methods tomorrow.

kuujo commented 9 years ago

@madjam I decided to take a quick stab at it tonight. I haven't put 0594601ec48057fe543883e9d4d2f80f619cc3a8 through much testing other than running the existing join/leave tests. Basically, what I did was replaced JoinState and LeaveState with methods in ServerState. The join() method behaves like the JoinState did previously, but the leave() method essentially passes the leave call directly to the internal state to allow it to be handled by e.g. a follower, the leader, etc. Can you check this out and see what you think? I'm too tired to do any more hacking on this tonight, but would be interested in your feedback.

madjam commented 9 years ago

I have to take a more closer look. But after a quick glance and running couple of tests I feel we'll have trouble shrinking a 2 node cluster to 1 node. The reason being: when a node (which is leader) wants to leave, it transitions to leave state which is a descendant of inactive state. In this state it fails all requests (poll, vote, append). So a new leader cannot be elected.

madjam commented 9 years ago

@kuujo Here are couple of changes I tried yesterday that does manage to resize the cluster correctly in my testing. The idea is as follows: If the leaving node is a leader it transitions into a special RENOUNCE state from which it will try to commit the cluster config (RenounceState is a descendant of LeaderState) On the other hand if the leaving node is a follower, it will transition into LeaveState as before. The only difference now is that LeaveState is a ActiveState meaning the node can still participate in commits. That way in a 2 node cluster if the follower is leaving, the leader will still be able to commit the config change.

I felt this is the smallest diff that will get us to a working state. There is still some scope for failures to interfere but I feel this should be safe. Just thought I will update the thread just in case you find this direction useful.

kuujo commented 9 years ago

I mentioned that the master branch does indeed have the problem wherein a cluster cannot be downsized from 2 nodes to 1 node for the reasons related to LeaveState you mentioned.

But the cluster-configuration branch removes the LEAVE state and replaces it with similar retry behavior that operates on existing states. So, if close() is called on a LEADER the LeaderState will handle the reconfiguration, and if close() is called on a FOLLOWER the FollowerState will handle it. This is the proper/cleaner pattern for handling state-specific logic.

I added a test to the cluster-configuration branch that resizes the cluster from 1 to 3 nodes and back down. The reconfiguration test actually exposed a bug when servers are joining and leaving the cluster concurrently. That was fixed in the last commit.

kuujo commented 9 years ago

Tests were broken due to the fix to configuration changes.

One of the existing tests submitted commands to a 3/5 node cluster. The test cleanup close()es servers once the tests are complete. But in the case of a 3/5 node cluster, once the first server leaves the cluster it becomes a 2/4 node cluster and so a new configuration cannot be committed to remove the second node from the cluster. This is good behavior. Updating the tests to encapsulate reconfiguration only in the appropriate test methods.

kuujo commented 9 years ago

Also, while I'm at it, #23 fixes the bug recently found in single-member configuration changes.

kuujo commented 9 years ago

All tests pass. I'm closing this and we should open new issues for bugs in the implementation.

We're working on more advanced Jepsen tests that include reconfiguration. This will hopefully help expose any potential flaws in the implementation that might e.g. allow a dual majority.