atomix / copycat

A novel implementation of the Raft consensus algorithm
http://atomix.io/copycat
Apache License 2.0
581 stars 155 forks source link

Node cannot (re)join cluster if restarted (quickly) #73

Closed bgloeckle closed 8 years ago

bgloeckle commented 8 years ago

Assume a cluster with 2 nodes. Node 1 is the leader and 2 is a follower or is passive. Now, node 2 is shut down (=kill -9), but restarted quickly. This means that node 2 lost all of its main memory state and does not now about the leader of the cluster anymore. The CopycatServer will then be #open()ed again and it will try to re-join the cluster: It will send a JoinRequest to node 1 and receive a JoinResponse and then directly transition to passive or follower again (in ServerState#join(Iterator, CompletableFuture)). The problem now is that CopycatServer#open() on node 2 decides to wait until it receives information on who is the leader and only then it completes its CompletableFuture. But it won't receive the leaders address. This is set (if I see this correctly) in *State#append (LeaderState, ActiveState, PassiveState), but that method won't be called, as node 1 (the leader) decides to not send a new ConfigurationEntry, because it thinks that the node belongs to the cluster already (see LeaderState#join(JoinRequest), the if where the clusters members are checked).

Am I now using the CopycatServer wrong somehow? AFAIK there cannot be a "timeout" of a node when it is assumed that the node left the cluster without it sending a LeaveRequest (as that would effectively allow split-brains), therefore the above problem might not even be one if a node is restarted quickly, but also if restarted after some time, e.g. after it has been killed (kill -9, network partition where the Ops people decided to restart the servers, power-loss on the machines, ...). Are there any integration tests that test such behavior? This should be fairly simple to test: Start a cluster, kill one node that is not the leader, restart the node and check if it connected to the cluster successfully again.

I can see that this would not be a that big problem in a cluster where there's a lot of activity (= a lot of AppendRequests issued), as the node will reconnect on the first AppendRequest being sent. But there might be clusters where there's not that much activity, but the application (which uses copycat) wants to wait for the local copycat server to be initialized and then send some Commands to the cluster...

Perhaps the leader should append a NoOpEntry?

kuujo commented 8 years ago

The reason this will not happen is because, when server 2 starts and rejoins the cluster, the leader will indeed see that it's already a member. But because server 2 is a member, the leader will still send periodic AppendRequests to server 2. Even if no commands are committed, leaders still must send an empty AppendRequest to maintain their leadership. This is sent every few hundred milliseconds (by default). So, after server 2 rejoins, it will set an election timeout. Within that election timeout, the leader will send an AppendRequest and server 2 will learn about the leader.

We could also add the leader to JoinResponse so restarting servers learn about the leader more quickly.

BTW as for testing, this is all tested very thoroughly in real networks with Jepsen, which arbitrarily partitions and crashes servers while submitting commands to the cluster. Jepsen tests force a number of different partition and failure scenarios that ultimately force many leader changes, require servers to converge on the correct leader, and then validate that the resulting state is linearizable.

On Dec 1, 2015, at 2:32 AM, Bastian Glöckle notifications@github.com wrote:

Assume a cluster with 2 nodes. Node 1 is the leader and 2 is a follower or is passive. Now, node 2 is shut down (=kill -9), but restarted quickly. This means that node 2 lost all of its main memory state and does not now about the leader of the cluster anymore. The CopycatServer will then be #open()ed again and it will try to re-join the cluster: It will send a JoinRequest to node 1 and receive a JoinResponse and then directly transition to passive or follower again (in ServerState#join(Iterator, CompletableFuture)). The problem now is that CopycatServer#open() on node 2 decides to wait until it receives information on who is the leader and only then it completes its CompletableFuture. But it won't receive the leaders address. This is set (if I see this correctly) in *State#append (LeaderState, ActiveState, PassiveState), but that method won't be called, as node 1 (the leader) decides to not send a new ConfigurationEntry, because it thinks that the node belongs to the clus ter alre ady (see LeaderState#join(JoinRequest), the if where the clusters members are checked).

Am I now using the CopycatServer wrong somehow? AFAIK there cannot be a "timeout" of a node when it is assumed that the node left the cluster without it sending a LeaveRequest (as that would effectively allow split-brains), therefore the above problem might not even be one if a node is restarted quickly, but also if restarted after some time, e.g. after it has been killed (kill -9, network partition where the Ops people decided to restart the servers, power-loss on the machines, ...). Are there any integration tests that test such behavior? This should be fairly simple to test: Start a cluster, kill one node that is not the leader, restart the node and check if it connected to the cluster successfully again.

I can see that this would not be a that big problem in a cluster where there's a lot of activity (= a lot of AppendRequests issued), as the node will reconnect on the first AppendRequest being sent. But there might be clusters where there's not that much activity, but the application (which uses copycat) wants to wait for the local copycat server to be initialized and then send some Commands to the cluster...

Perhaps the leader should append a NoOpEntry?

— Reply to this email directly or view it on GitHub.

bgloeckle commented 8 years ago

Ah, thanks for the explanation. I'm very sure that I saw this in my local diqube setup here yesterday: Node 2 did not learn about the leader. But to be honest, I'm not 100% sure about the keep alives any more and what Duration I had set for those. And after I debugged and found that source location, I was pretty sure I found a bug. Sorry, my mistake :/