Open NicolasT opened 11 years ago
Hi, I'm really interested in this feature and ready to implement it. Could you please help me with implementation road map? Already I have studied Raft paper and logCabin membership change implementation. I feel empowered to implement it in kontiki. :)
Hey, the notification for this caught my eye. I think you'll both be interested in the newly revised cluster membership changes chapter in my thesis: https://ramcloud.stanford.edu/~ongaro/thesis.pdf . That chapter now describes a simplified form of membership changes that's restricted to single-server removals and additions. I'd recommend implementing that version instead of the full-blown joint consensus approach, unless your use case somehow requires arbitrary configuration changes.
Thanks a lot, Diego. I will read your thesis.
@drchaos, there's no real 'roadmap' for Kontiki... If you'd like to take a stab at cluster membership support, please do! I have no design in mind for now, so do share your thoughts.
I am also working on single node member changes. I hope to have something to show by the end of the weekend. I am also going to be adding the states needed to catch up the logs of nodes that are joining.
@sseveran That's some great news. Is some of your work or a design overview available already, so @drchaos could jump in?
I would be joined with pleasure to @sseveran in his work.
So first a little background on what I am working on. I built a non replicated Chubby last year as a test and implemented a medium sized application using it. The file/sequencer model works great, probably even better in haskell than in c++. I made monad transformer that made it super easy to deal with locks. My long term goal is to support the full chubby feature set as well as larger than memory data using something like either a heavily modified LevelDB or RocksDB or maybe something else. I also want to support the zookeeper wire protocol. I have spoken with several people who would be very interested in something better for use at their companies. Early hackery (not my prototype) is at http:///www.github.com/alphaHeavy/hinu
Several generalizations need to be made to kontiki:
I envision a node joining looking like the following:
NonVotingFollower continues to apply any received transactions without voting until it receives the appendenteries message that marks it as in the config. This means that the NonVotingFollower needs to inspect the content of the messages for the reserved config key and verify that it is the node being added. Once it sees this it will begin voting.
A few more notes: We may want the network state to be parameterized so applications can supply their own types for addressing e.g. IPV4 vs IPV6. The join process needs a timeout so if a joining node dies the leader is not stuck in a joining state. The log copy should probably be a function the user passes in somewhere. I don't think we want the network state to be too tightly coupled to the core implementation. If the leader fails during the join process the new node will need to start over from the beginning. Its not worth trying to be too fancy here.
I have thought a lot less about the removal but here are a few thoughts.
After a quick read, that sounds at least a little bit buggy. For example, suppose we have a leader of a 1-server cluster: node1 is leader node1 catches up new node2 node1 adds configuration including node2 to its log (at this point, it needs node2 to form a quorum) node1 sends AppendEntries to node2, but this message never makes it node1 restarts node1 needs node2's vote to become leader, but node2 won't vote because it's not in its current configuration. So no server becomes leader ever again. Sad cluster.
I don't mean to take the fun out of this, but it might be a good idea to use the membership change algorithm described in my thesis as a baseline, then discuss well-motivated changes and extensions to better suit your needs.
@ongardie Is the algorithm specified anywhere? Proof language is fine. I started by simply trying to reverse engineer the text. I don't want to do any novel work if I can avoid it.
@sseveran it's a bit further back in the thread: https://ramcloud.stanford.edu/~ongaro/thesis.pdf
So I missed a step. For the leader to commit the quorum change into the log it needs to involve the new quorum, not the old quorum. So both Followers and our NonVotingFollower would need to vote to commit it. The NonVotingFollower logic would be the same except it would begin voting in the round where it sees the cluster configuration being modified.
A couple more thoughts on the boxes on page 54.
For number 2 in SetConfiguration RPC I am not sure a specified number of rounds is right since our database may be arbitrarily large. The number of rounds would could be dependent on server load. My thought was to have an adjustable timeout for this. For instance we may wish to join a replica from a remote data center which may have very different throughput for replicating the existing state.
My larger question about both those boxes is who calls them. I was anticipating that the node would call the leader when he starts up. I don't think I have missed anything that would make this impossible but I may have.
@ongardie Also don't worry about making fun. I take correctness here very seriously. So if you have any more comments or thoughts please share them.
@sseveran Cool. You can definitely have servers automatically call in to add themselves to the cluster, it just depends on how you want the system to work. It's a bit more troubling if the system could automatically reduce its size when servers fail, though, since it would also be reducing its fault tolerance.
Some thoughts:
Re 'Config needs to be updatable': I think this whole cluster membership change can be implemented on top of the existing API. The existing API is polymorphic over the type of values for which consensus is reached, so a the whole cluster management stuff could be implemented using a type like
data Entry a = ConfigurationChange Config | UserEntry a
Re 'runTransitionT needs to return an updated config': when using the system as depicted above, there's no need for this, and some library code which offers configuration update mechanisms can do this by interpreting the current output of runTransitionT
. This 'library code' can most certainly be a module in kontiki
, of course.
Re 'MonadLog should support snapshoting and log replay': I'd rather keep Kontiki abstract of an actual log implementation, hence I don't see why snapshotting and replay should be part of it?
Re 'We need a new non-voting follower state': yeah, that's a great idea, similar to Paxos' learner nodes. We added a similar feature to Arakoon in order to increase reliability without incurring more latency impact.
Re 'The node configuration will be stored at a special reserved key in the transaction log': If the polymorphism of what goes in the log is used, this could be in the cluster-membership-as-a-library code as well, without any 'special key'. Actually, a log doesn't have any keys at all, isn't it?
Re 'We may want the network state to be parameterized so applications can supply their own types for addressing e.g. IPV4 vs IPV6': I think any network state should be kept and managed by the application, not the library -> there are tons of different use-cases and setups which one can't foresee. You mention the IP version, another example (also from Arakoon) is multiple addresses per cluster node (and only one is used for message transport when using TCP, whichever is available),... IMHO a cluster node should be only specified by its unique identifier. How this is mapped to some network address is up to the application, and could (in case of a key-value application) be kept using some specific key.
Re 'There needs to be a way of marking a node as down. This should probably be a function the user passes in. In the UDP example it might be if the queue gets too big.': I'm afraid I don't understand the rationale. Why would this be required? IMHO removing a node should be done on the application level, or (more likely) at some higher-level management layer.
Re 'We may want to have a maintenance mode so we can continue to buffer entries for a node for much longer to accommodate things like OS upgrades. For now its probably fine to just say that is supported by removing a node, performing maintenance and then rejoining it.': I'm note sure I follow here either... The system as-is should handle node outages/network splits/... gracefully, and an OS upgrade is as much a node outage as any other.
Basically, through experience with Arakoon, I think it's important Kontiki retains the following principles:
Command
s which upon interpretation influence the log (e.g. CTruncateLog
, CLogEntries
,...). The same can be used for snaphotting etc (if applicable, of which I'm not convinced yet, see above).NodeId
being ByteString
and nothing more)Note I didn't read @ongardie's thesis (sorry Diego, should do that sometime soon!), so I'm not familiar with the config change system being discussed yet. I'm familiar with the original 2 stage, 2 quorum mechanism and designed Kontiki in order for that to be possible, given the type of _configNodes
is changed to [NodeSet]
, isMajority
is updated accordingly and a CUpdateNodeSet [NodeSet]
Command
is added.
It'd be useful to implement the 'correct' protocol to change cluster membership/topologies, as discissed (IIRC) in the Raft paper.