NicolasT / kontiki

An implementation of the Raft consensus protocol
BSD 3-Clause "New" or "Revised" License
122 stars 15 forks source link

Implement cluster membership change support #3

Open NicolasT opened 10 years ago

NicolasT commented 10 years ago

It'd be useful to implement the 'correct' protocol to change cluster membership/topologies, as discissed (IIRC) in the Raft paper.

drchaos commented 10 years ago

Hi, I'm really interested in this feature and ready to implement it. Could you please help me with implementation road map? Already I have studied Raft paper and logCabin membership change implementation. I feel empowered to implement it in kontiki. :)

ongardie commented 10 years ago

Hey, the notification for this caught my eye. I think you'll both be interested in the newly revised cluster membership changes chapter in my thesis: https://ramcloud.stanford.edu/~ongaro/thesis.pdf . That chapter now describes a simplified form of membership changes that's restricted to single-server removals and additions. I'd recommend implementing that version instead of the full-blown joint consensus approach, unless your use case somehow requires arbitrary configuration changes.

drchaos commented 10 years ago

Thanks a lot, Diego. I will read your thesis.

NicolasT commented 10 years ago

@drchaos, there's no real 'roadmap' for Kontiki... If you'd like to take a stab at cluster membership support, please do! I have no design in mind for now, so do share your thoughts.

sseveran commented 10 years ago

I am also working on single node member changes. I hope to have something to show by the end of the weekend. I am also going to be adding the states needed to catch up the logs of nodes that are joining.

NicolasT commented 10 years ago

@sseveran That's some great news. Is some of your work or a design overview available already, so @drchaos could jump in?

drchaos commented 10 years ago

I would be joined with pleasure to @sseveran in his work.

sseveran commented 10 years ago

So first a little background on what I am working on. I built a non replicated Chubby last year as a test and implemented a medium sized application using it. The file/sequencer model works great, probably even better in haskell than in c++. I made monad transformer that made it super easy to deal with locks. My long term goal is to support the full chubby feature set as well as larger than memory data using something like either a heavily modified LevelDB or RocksDB or maybe something else. I also want to support the zookeeper wire protocol. I have spoken with several people who would be very interested in something better for use at their companies. Early hackery (not my prototype) is at http:///www.github.com/alphaHeavy/hinu

Several generalizations need to be made to kontiki:

  1. Config needs to be updatable
  2. runTransitionT needs to return an updated config
  3. MonadLog should support snapshoting and log replay. (This is if we want to put the maximum amount of functionality into kontiki as opposed to applications that use it. This is probably preferable for correctness) It may also be able to be one user supplied function that gets called by kontiki.
  4. We need a new non-voting follower state. During this phase a new follower is going to recieve the leaders snapshot as well as a stream of all applied updates. It will buffer the updates until the snapshot is complete and then apply all the buffered updates.
  5. The node configuration will be stored at a special reserved key in the transaction log.

I envision a node joining looking like the following:

  1. Send message to leader to begin join.
  2. Leader adds a record for the node in the NonVotingFollower state. Leader changes to some state to indicate that a node is joining. No further nodes may join in this state.
  3. Leader creates a snapshot of its data at its current index. Leader sends the current index to NonVotingFollower. Leader begins sending the snapshot to NonVotingFollower.
  4. Leader sends all append entries to NonVotingFollower
  5. NonVotingFollower completes receiving snapshot. NonVotingFollower begins applying all new transactions to its log.
  6. When NonVotingFollower has no transactions in its buffer it sends a message to the leader to complete the join.
  7. NonVotingFollower continues to apply any received transactions without voting until it receives the appendenteries message that marks it as in the config. This means that the NonVotingFollower needs to inspect the content of the messages for the reserved config key and verify that it is the node being added. Once it sees this it will begin voting.

    A few more notes: We may want the network state to be parameterized so applications can supply their own types for addressing e.g. IPV4 vs IPV6. The join process needs a timeout so if a joining node dies the leader is not stuck in a joining state. The log copy should probably be a function the user passes in somewhere. I don't think we want the network state to be too tightly coupled to the core implementation. If the leader fails during the join process the new node will need to start over from the beginning. Its not worth trying to be too fancy here.

I have thought a lot less about the removal but here are a few thoughts.

  1. There needs to be a way of marking a node as down. This should probably be a function the user passes in. In the UDP example it might be if the queue gets too big.
  2. We may want to have a maintenance mode so we can continue to buffer entries for a node for much longer to accommodate things like OS upgrades. For now its probably fine to just say that is supported by removing a node, performing maintenance and then rejoining it.
ongardie commented 10 years ago

After a quick read, that sounds at least a little bit buggy. For example, suppose we have a leader of a 1-server cluster: node1 is leader node1 catches up new node2 node1 adds configuration including node2 to its log (at this point, it needs node2 to form a quorum) node1 sends AppendEntries to node2, but this message never makes it node1 restarts node1 needs node2's vote to become leader, but node2 won't vote because it's not in its current configuration. So no server becomes leader ever again. Sad cluster.

I don't mean to take the fun out of this, but it might be a good idea to use the membership change algorithm described in my thesis as a baseline, then discuss well-motivated changes and extensions to better suit your needs.

sseveran commented 10 years ago

@ongardie Is the algorithm specified anywhere? Proof language is fine. I started by simply trying to reverse engineer the text. I don't want to do any novel work if I can avoid it.

NathanHowell commented 10 years ago

@sseveran it's a bit further back in the thread: https://ramcloud.stanford.edu/~ongaro/thesis.pdf

sseveran commented 10 years ago

So I missed a step. For the leader to commit the quorum change into the log it needs to involve the new quorum, not the old quorum. So both Followers and our NonVotingFollower would need to vote to commit it. The NonVotingFollower logic would be the same except it would begin voting in the round where it sees the cluster configuration being modified.

A couple more thoughts on the boxes on page 54.

For number 2 in SetConfiguration RPC I am not sure a specified number of rounds is right since our database may be arbitrarily large. The number of rounds would could be dependent on server load. My thought was to have an adjustable timeout for this. For instance we may wish to join a replica from a remote data center which may have very different throughput for replicating the existing state.

My larger question about both those boxes is who calls them. I was anticipating that the node would call the leader when he starts up. I don't think I have missed anything that would make this impossible but I may have.

@ongardie Also don't worry about making fun. I take correctness here very seriously. So if you have any more comments or thoughts please share them.

ongardie commented 10 years ago

@sseveran Cool. You can definitely have servers automatically call in to add themselves to the cluster, it just depends on how you want the system to work. It's a bit more troubling if the system could automatically reduce its size when servers fail, though, since it would also be reducing its fault tolerance.

NicolasT commented 10 years ago

Some thoughts:

Re 'Config needs to be updatable': I think this whole cluster membership change can be implemented on top of the existing API. The existing API is polymorphic over the type of values for which consensus is reached, so a the whole cluster management stuff could be implemented using a type like

data Entry a = ConfigurationChange Config | UserEntry a

Re 'runTransitionT needs to return an updated config': when using the system as depicted above, there's no need for this, and some library code which offers configuration update mechanisms can do this by interpreting the current output of runTransitionT. This 'library code' can most certainly be a module in kontiki, of course.

Re 'MonadLog should support snapshoting and log replay': I'd rather keep Kontiki abstract of an actual log implementation, hence I don't see why snapshotting and replay should be part of it?

Re 'We need a new non-voting follower state': yeah, that's a great idea, similar to Paxos' learner nodes. We added a similar feature to Arakoon in order to increase reliability without incurring more latency impact.

Re 'The node configuration will be stored at a special reserved key in the transaction log': If the polymorphism of what goes in the log is used, this could be in the cluster-membership-as-a-library code as well, without any 'special key'. Actually, a log doesn't have any keys at all, isn't it?

Re 'We may want the network state to be parameterized so applications can supply their own types for addressing e.g. IPV4 vs IPV6': I think any network state should be kept and managed by the application, not the library -> there are tons of different use-cases and setups which one can't foresee. You mention the IP version, another example (also from Arakoon) is multiple addresses per cluster node (and only one is used for message transport when using TCP, whichever is available),... IMHO a cluster node should be only specified by its unique identifier. How this is mapped to some network address is up to the application, and could (in case of a key-value application) be kept using some specific key.

Re 'There needs to be a way of marking a node as down. This should probably be a function the user passes in. In the UDP example it might be if the queue gets too big.': I'm afraid I don't understand the rationale. Why would this be required? IMHO removing a node should be done on the application level, or (more likely) at some higher-level management layer.

Re 'We may want to have a maintenance mode so we can continue to buffer entries for a node for much longer to accommodate things like OS upgrades. For now its probably fine to just say that is supported by removing a node, performing maintenance and then rejoining it.': I'm note sure I follow here either... The system as-is should handle node outages/network splits/... gracefully, and an OS upgrade is as much a node outage as any other.

Basically, through experience with Arakoon, I think it's important Kontiki retains the following principles:

Note I didn't read @ongardie's thesis (sorry Diego, should do that sometime soon!), so I'm not familiar with the config change system being discussed yet. I'm familiar with the original 2 stage, 2 quorum mechanism and designed Kontiki in order for that to be possible, given the type of _configNodes is changed to [NodeSet], isMajority is updated accordingly and a CUpdateNodeSet [NodeSet] Command is added.