cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.97k stars 3.79k forks source link

storage: send MsgAppResp to replica co-located with client #325

Closed tbg closed 8 years ago

tbg commented 9 years ago

@spencerkimball mentioned this in #230:

Because a major focus for all leasing strategies is to reduce latency in the wide area, we implemented as part of our baseline a latency optimization described by Castro [5]. This optimization reduces the commit latency perceived by clients that are co-located with a replica other than the Multi-Paxos stable leader, by having other replicas transmit their AcceptReplies to both the stable leader and to the replica near the client. Thus, in the common case, the client does not need to wait the additional time for a message to come back from the stable leader, which reduces commit time from four one-way delays to three.

The paper is Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems, 20(4):398–461, November 2002.

This is certainly something that would be useful.

bdarnell commented 9 years ago

This seems risky when translated from Paxos to Raft. If a leader broadcasts MsgApp and then dies, but the MsgAppResps are sent to the other followers, then those followers may consider the entry committed even though it was not committed by the leader of that term. This may turn out to be safe (if a quorum have the new log entry then no node without it can be elected), but it's very subtle.

tbg commented 9 years ago

I agree that it's subtle. Yet, as you say, if an entry is acknowledged by a majority of the nodes (within the same term - otherwise see raft.pdf, Figure 8), then, no matter what the leader does, that entry is guaranteed to be committed as nobody can be elected leader without it. So it should be sufficient to do the following:

Does that sound sound?

spencerkimball commented 9 years ago

It should be "once the counter is equal to the quorum size, commit the entry locally". Unless I'm missing something.

tbg commented 9 years ago

Almost. Changed to "equals or exceeds". Since we're incrementing by two in one of the cases, we may not exactly hit the quorum.

bdarnell commented 9 years ago

This seems sound, although figuring out which entries are covered by another node's MsgAppResp is non-trivial.

tbg commented 9 years ago

I hadn't thought about that. The sending node is at the discretion to put whatever they want in that message, so in the worst case they could hash the data of the log entry (or whatever other unique identifier we can come up with) and send that along as an Entry when CC'ing.

Also maybe the CC field should sit on the Entry instead of the MsgApp because MsgApp may contain multiple entries with different CCs and that should trigger a CC message to each CC'ed host with Entries (containing unique identifiers) for each CCed entry.

Furthermore, the CC'ed MsgAppResp shouldn't look exactly like the original MsgAppResp as there's a chance the recipient CC node will have turned into a leader in the meantime and doesn't know this isn't a real MsgAppResp to a MsgApp which the new leader may have sent (and while that shouldn't hinder correctness, it needs to be taken into account to avoid unwanted side effects). We could special-case uint64(0) to be used as a dummy logIndex, or Reject should be true, with the index of the original MsgApp as RejectHint. But it is simpler, safer and more descriptive to invent a new type MsgCC for exactly that purpose, so my suggestion is

bdarnell commented 9 years ago

If the CC'd node has become a leader, then the term number will have increased, so the old MsgAppResp will be rejected on that basis. We don't need to make the CC messages look distinct for safety purposes, although doing so to pass along extra information might help.

This looks like a lot of subtle work for a small gain, though. It only applies in cases where the client is talking to a follower which is proxying to the leader, and per discussion in #326 we want to minimize that behavior (if we do it at all).

tbg commented 9 years ago

Yes, if we don't want that, let's not waste time with it. It does look pretty straightforward now though. CC @xiang90 in case someone wants it for upstream raft.

xiang90 commented 9 years ago

@tschottdorf I will take a close look tonight.

xiang90 commented 9 years ago

@tschottdorf Quick reply: I think it is doable and we could try to make this happen in raft. I am not sure if the gain worth it though. Unless you really care about the committing latency when proxying.

@bdarnell One observation from my pervious raft/paxos/epaxos performance tuning experience: proxy is actually good in some cases, since it help with handling client request and encoding it to a []byte. It reduce CPU load on one instance a lot. But for multi-raft, probably not. A node will be both follower and leader. The roles will balance this.

tbg commented 9 years ago

@xiang90 don't sweat it, I just wanted you to be aware of the discussion we had about it in case you wanted something like that in etcd/raft.

tbg commented 8 years ago

Doesn't seem that this is anywhere on the roadmap, and if so, it would probably have to be in etcd/raft anyway. Plus, our use case for this isn't really there given that we "always" propose at the leader (with few exceptions).