Cluster membership changes

keathley commented 6 years ago

We need to support adding and removing peers to the raft cluster. While this is described briefly in section 6 of the raft paper I feel like the explanation is underspecified specifically with regards to the rejection of request vote rpcs and the explanation of "catching up" new peers before initiating the joint consensus.

I'm reading through the raft mailing list, and other resources that I'll link here in order to get a better feel for potential improvements to the solution. Right now my intuition is that we should look at using the AddServer and RemoveServer RPCs from the "ongaro thesis" which I've linked below.

Research / Links

bitwalker commented 6 years ago

I'm certainly interested in working on this, so I'll set aside some time to read the paper again and catch up on some of the other links you mentioned here. My thought would be to review how etcd or Consul do it, and borrow their approach, since it is relatively battle-tested, rather than implementing our own interpretation of it (assuming we can't find a concrete "this is exactly how you do it" process).

keathley commented 6 years ago

Most real implementations use the "pre-vote rpc" along with the AddServer and RemoveServer Rpcs. Both are defined in that "modifications for Raft consensus" document. Based on my research thats what etcd does as well. I'm not sure hashicorps raft has incorporated these changes at this point.

keathley commented 6 years ago

So after reading through all this I think it makes the most sense to use the explicit steps for initializing the cluster, adding peers, and removing peers. I think we can use something like:

@type peer :: module() | {module(), node()}

@spec initialize_cluster(peer()) :: :ok | {:error, term()}
@spec add_peer(peer(), peer()) :: :ok | {:error, {:redirect, peer()}} | {:error, :not_leader}
@spec remove_peer(peer(), peer()) :: :ok | {:error, {:redirect, peer()}} | {:error, :not_leader}

I think I have a pretty good handle on how all of this works based on the thesis paper so I'll try to lay that out here when I get a chance.

bitwalker commented 6 years ago

So my take on this is that while an explicit API is nice to have (particularly for testing), it is possible to make this automatic using node monitoring (where we effectively get the equivalent of add_peer and remove_peer events). This is how Swarm handles automatically forming a cluster, although it does support a way to blacklist/whitelist nodes so that only those you want participating in the cluster are able to.

Reading through the paper and it's section on cluster membership changes, I didn't see anything that indicated this would be a problem, but I haven't dug into the internals of this library yet, so I don't know if there is a constraint based on the implementation.

I spent some time yesterday putting together a distributed process registry based on raft, and by far the thing that stood out to me was that initializing the cluster was a manual process, and there isn't a great place to put that in our own applications. It would be preferable if the implementation of the state machine server was working to form a cluster on it's own. This behaviour would need to be configurable, because there are obviously times where you may want to manually manage the configuration (testing again is an example), but I think for a library like this to be successful, it needs to be easy to use by default, and offer escape hatches for things which may need tweaking in some situations.

Thoughts?

keathley commented 6 years ago

I totally agree that we need to make cluster membership changes automatic. I can work on implementing the underlying functions such as initialize_cluster, add_peer, etc. so that we can leverage them from a higher level abstraction that can automatically call those based on node monitoring.

tzumby commented 5 years ago

Hey guys, this may be a dumb question but wouldn't it be possible to just enumerate Node.list() and as nodes join the cluster have the already existing AppendEntries RPC get them to catch up ? I could see this being a problem if we have lots of nodes but from what I gather that's not the case with Raft clusters.

elixir-toniq / raft

Cluster membership changes #4

Research / Links