Closed tbg closed 8 years ago
This seems risky when translated from Paxos to Raft. If a leader broadcasts MsgApp and then dies, but the MsgAppResps are sent to the other followers, then those followers may consider the entry committed even though it was not committed by the leader of that term. This may turn out to be safe (if a quorum have the new log entry then no node without it can be elected), but it's very subtle.
I agree that it's subtle. Yet, as you say, if an entry is acknowledged by a majority of the nodes (within the same term - otherwise see raft.pdf, Figure 8), then, no matter what the leader does, that entry is guaranteed to be committed as nobody can be elected leader without it. So it should be sufficient to do the following:
CC
(Carbon Copy) field in the MsgApp
message.CC
for the corresponding MsgApp
messages.MsgApp
with CC
set, send the MsgAppResp
to the CC
replica as well.MsgAppResp
s with identical term for that entry (in practice, just give up when there is more than one term; also you probably don't want to prepare to actually count, but just start counting when the first MsgAppResp shows up).MsgAppResp
for that entry, and the term matches, increase the counter by two (to account for the fact that the co-located replica and the leader clearly have this entry).Does that sound sound?
It should be "once the counter is equal to the quorum size, commit the entry locally". Unless I'm missing something.
Almost. Changed to "equals or exceeds". Since we're incrementing by two in one of the cases, we may not exactly hit the quorum.
This seems sound, although figuring out which entries are covered by another node's MsgAppResp
is non-trivial.
I hadn't thought about that. The sending node is at the discretion to put whatever they want in that message, so in the worst case they could hash the data of the log entry (or whatever other unique identifier we can come up with) and send that along as an Entry when CC'ing.
Also maybe the CC
field should sit on the Entry
instead of the MsgApp
because MsgApp
may contain multiple entries with different CC
s and that should trigger a CC message to each CC'ed host with Entries (containing unique identifiers) for each CCed entry.
Furthermore, the CC'ed MsgAppResp shouldn't look exactly like the original MsgAppResp as there's a chance the recipient CC node will have turned into a leader in the meantime and doesn't know this isn't a real MsgAppResp to a MsgApp which the new leader may have sent (and while that shouldn't hinder correctness, it needs to be taken into account to avoid unwanted side effects). We could special-case uint64(0)
to be used as a dummy logIndex, or Reject
should be true
, with the index of the original MsgApp
as RejectHint
. But it is simpler, safer and more descriptive to invent a new type MsgCC
for exactly that purpose, so my suggestion is
MsgCC
(or MsgAppRespCC
, whichever seems clearer)CC
field to the Entry
typeMsgAppResp
, go through the entries in it and compile a MsgCC
message for each node in the set of all Entry.CC
fields in the entriesMsgCC.Entries
should contain, for each original Entry
that is being CC'ed to that member, an Entry
that unique describes the original one and allows the receiving node to commit that entry locally when it has collected a quorum of identical descriptors (and also has a gapless history of previous indexes so that the log completeness property holds):
I think those two should be enough (if a majority of servers accept a log entry with same index and same term, then that is committed and hence equal). Otherwise, can always add a hash of the original entry's Data
or something along those lines in the Data
field.
If the CC'd node has become a leader, then the term number will have increased, so the old MsgAppResp will be rejected on that basis. We don't need to make the CC messages look distinct for safety purposes, although doing so to pass along extra information might help.
This looks like a lot of subtle work for a small gain, though. It only applies in cases where the client is talking to a follower which is proxying to the leader, and per discussion in #326 we want to minimize that behavior (if we do it at all).
Yes, if we don't want that, let's not waste time with it. It does look pretty straightforward now though. CC @xiang90 in case someone wants it for upstream raft.
@tschottdorf I will take a close look tonight.
@tschottdorf Quick reply: I think it is doable and we could try to make this happen in raft. I am not sure if the gain worth it though. Unless you really care about the committing latency when proxying.
@bdarnell One observation from my pervious raft/paxos/epaxos performance tuning experience: proxy is actually good in some cases, since it help with handling client request and encoding it to a []byte. It reduce CPU load on one instance a lot. But for multi-raft, probably not. A node will be both follower and leader. The roles will balance this.
@xiang90 don't sweat it, I just wanted you to be aware of the discussion we had about it in case you wanted something like that in etcd/raft
.
Doesn't seem that this is anywhere on the roadmap, and if so, it would probably have to be in etcd/raft
anyway. Plus, our use case for this isn't really there given that we "always" propose at the leader (with few exceptions).
@spencerkimball mentioned this in #230:
The paper is Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems, 20(4):398–461, November 2002.
This is certainly something that would be useful.