storage: send MsgAppResp to replica co-located with client

tbg commented 9 years ago

@spencerkimball mentioned this in #230:

Because a major focus for all leasing strategies is to reduce latency in the wide area, we implemented as part of our baseline a latency optimization described by Castro [5]. This optimization reduces the commit latency perceived by clients that are co-located with a replica other than the Multi-Paxos stable leader, by having other replicas transmit their AcceptReplies to both the stable leader and to the replica near the client. Thus, in the common case, the client does not need to wait the additional time for a message to come back from the stable leader, which reduces commit time from four one-way delays to three.

The paper is Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems, 20(4):398–461, November 2002.

This is certainly something that would be useful.

bdarnell commented 9 years ago

This seems risky when translated from Paxos to Raft. If a leader broadcasts MsgApp and then dies, but the MsgAppResps are sent to the other followers, then those followers may consider the entry committed even though it was not committed by the leader of that term. This may turn out to be safe (if a quorum have the new log entry then no node without it can be elected), but it's very subtle.

tbg commented 9 years ago

I agree that it's subtle. Yet, as you say, if an entry is acknowledged by a majority of the nodes (within the same term - otherwise see raft.pdf, Figure 8), then, no matter what the leader does, that entry is guaranteed to be committed as nobody can be elected leader without it. So it should be sufficient to do the following:

add an optional CC (Carbon Copy) field in the MsgApp message.
as a leader, upon receiving a request from follower, note that follower in CC for the corresponding MsgApp messages.
upon replying positively to a MsgApp with CC set, send the MsgAppResp to the CC replica as well.
at the colocated replica:
- upon redirecting a command to the leader, get ready to receive and count MsgAppResps with identical term for that entry (in practice, just give up when there is more than one term; also you probably don't want to prepare to actually count, but just start counting when the first MsgAppResp shows up).
- check the replica's outgoing messages: If it is sending a MsgAppResp for that entry, and the term matches, increase the counter by two (to account for the fact that the co-located replica and the leader clearly have this entry).
- once the counter equals or exceeds the quorum size, commit the entry locally.
- once the entry is committed locally or overwritten, stop counting.

Does that sound sound?

spencerkimball commented 9 years ago

It should be "once the counter is equal to the quorum size, commit the entry locally". Unless I'm missing something.

tbg commented 9 years ago

Almost. Changed to "equals or exceeds". Since we're incrementing by two in one of the cases, we may not exactly hit the quorum.

bdarnell commented 9 years ago

This seems sound, although figuring out which entries are covered by another node's MsgAppResp is non-trivial.

tbg commented 9 years ago

I hadn't thought about that. The sending node is at the discretion to put whatever they want in that message, so in the worst case they could hash the data of the log entry (or whatever other unique identifier we can come up with) and send that along as an Entry when CC'ing.

Also maybe the CC field should sit on the Entry instead of the MsgApp because MsgApp may contain multiple entries with different CCs and that should trigger a CC message to each CC'ed host with Entries (containing unique identifiers) for each CCed entry.

Furthermore, the CC'ed MsgAppResp shouldn't look exactly like the original MsgAppResp as there's a chance the recipient CC node will have turned into a leader in the meantime and doesn't know this isn't a real MsgAppResp to a MsgApp which the new leader may have sent (and while that shouldn't hinder correctness, it needs to be taken into account to avoid unwanted side effects). We could special-case uint64(0) to be used as a dummy logIndex, or Reject should be true, with the index of the original MsgApp as RejectHint. But it is simpler, safer and more descriptive to invent a new type MsgCC for exactly that purpose, so my suggestion is

adding a message type MsgCC (or MsgAppRespCC, whichever seems clearer)
moving the CC field to the Entry type
after sending a MsgAppResp, go through the entries in it and compile a MsgCC message for each node in the set of all Entry.CC fields in the entries
MsgCC.Entries should contain, for each original Entry that is being CC'ed to that member, an Entry that unique describes the original one and allows the receiving node to commit that entry locally when it has collected a quorum of identical descriptors (and also has a gapless history of previous indexes so that the log completeness property holds):
- Term
- Index
I think those two should be enough (if a majority of servers accept a log entry with same index and same term, then that is committed and hence equal). Otherwise, can always add a hash of the original entry's Data or something along those lines in the Data field.

bdarnell commented 9 years ago

If the CC'd node has become a leader, then the term number will have increased, so the old MsgAppResp will be rejected on that basis. We don't need to make the CC messages look distinct for safety purposes, although doing so to pass along extra information might help.

This looks like a lot of subtle work for a small gain, though. It only applies in cases where the client is talking to a follower which is proxying to the leader, and per discussion in #326 we want to minimize that behavior (if we do it at all).

tbg commented 9 years ago

Yes, if we don't want that, let's not waste time with it. It does look pretty straightforward now though. CC @xiang90 in case someone wants it for upstream raft.

xiang90 commented 9 years ago

@tschottdorf I will take a close look tonight.

xiang90 commented 9 years ago

@tschottdorf Quick reply: I think it is doable and we could try to make this happen in raft. I am not sure if the gain worth it though. Unless you really care about the committing latency when proxying.

@bdarnell One observation from my pervious raft/paxos/epaxos performance tuning experience: proxy is actually good in some cases, since it help with handling client request and encoding it to a []byte. It reduce CPU load on one instance a lot. But for multi-raft, probably not. A node will be both follower and leader. The roles will balance this.

tbg commented 9 years ago

@xiang90 don't sweat it, I just wanted you to be aware of the discussion we had about it in case you wanted something like that in etcd/raft.

tbg commented 8 years ago

Doesn't seem that this is anywhere on the roadmap, and if so, it would probably have to be in etcd/raft anyway. Plus, our use case for this isn't really there given that we "always" propose at the leader (with few exceptions).

cockroachdb / cockroach

storage: send MsgAppResp to replica co-located with client #325