Closed bdarnell closed 9 years ago
Have you looked at http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-14-105.pdf?
Wow, this is excellent. I've been lying awake at night worrying about the shortcomings of leader leases as I'd imagined they had to work.
On Wed, Dec 17, 2014 at 7:28 PM, Xiang Li notifications@github.com wrote:
Have you looked at http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-14-105.pdf?
— Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/230#issuecomment-67422995 .
You received this message because you are subscribed to the Google Groups "Cockroach DB" group. To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
@xiang90 I hadn't seen that; thanks for the reference. It looks like a great fit for our needs.
(A reminder to whoever replied by email that your name doesn't come through on github when you do that, so you may want to come to github to post directly and be identified)
Sorry, that was me.
Are there any plans to implement the performance optimization on writes mentioned in the paper?
Because a major focus for all leasing strategies is to reduce latency in the wide area, we implemented as
part of our baseline a latency optimization described by Castro [5]. This optimization reduces the commit
latency perceived by clients that are co-located with a replica other than the Multi-Paxos stable leader, by
having other replicas transmit their AcceptReplies to both the stable leader and to the replica near the
client. Thus, in the common case, the client does not need to wait the additional time for a message to
come back from the stable leader, which reduces commit time from four one-way delays to three.
Citation is: [5] Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems, 20(4):398–461, November 2002.
@bdarnell @spencerkimball We do not have the plan to implement the read lease in etcd itself right now. But if you want to design/implement it, I would be happy to help.
I'm taking a look at this now. How far should we take what is outlined in the paper?
It looks like we want leader leases only, so we would not need the full flexibility in granting leases to multiple replicas (and don't need to keep as much of a 'lease state' as they do). The one aspect of the paper that I don't see in Ben's initial post is the 'guard' part of the lease granting message (which avoids losing time/deadlocking when a node bails while the lease is communicated). If we have a relatively low lease duration and refresh the leases automatically on heartbeats (plus any other Raft-traffic that acknowledges the leader), the overhead might not be worth it. We'll definitely need to implement the guarded promises for the initial grant.
Some more questions/remarks:
We will eventually want quorum leases so we can move read load to the followers, although we can start with just leader leases. I think this eventually belongs in upstream raft, although in the short term we can probably move faster by doing the work in multiraft or in our fork of etcd to start.
Any attempt to bias leader selection towards the nodes that have the most traffic should definitely be deferred to a separate issue/PR.
I think this eventually belongs in upstream raft
One difficulties of quorum lease
is that the original algo needs to be object awareness.
For the leader lease, we can simply buffer the proposals for longer than max(electionTiemout), so that the old leader can step-down.
We might put the step-down stuff into raft I think.
@bdarnell Also, if we can loose the latency requirement(xx ms vs x ms) it would not be hard for the followers to take read (improve the throughput) as well as ensuring linearizability IMO.
@xiang90 which latency requirement do you mean? From what I can see, leader leases can piggyback on existing traffic to some extent, but once you want to want a follower lease, that follower will need its own lease granted by a majority of the cluster, so you're looking at something like elections for each follower (who gets a lease).
Object awareness is something that we likely don't need here (given that ranges are supposed to represent locality of the data already) but it could also be implemented in etcd/raft by providing an interface that maps a read message to, say, uint64 which is then interpreted as object type.
@tschottdorf Let us step back a little bit. Why do we want the lease? What kind of guarantees do we want to have?
We just want what's described in the paper you posted: There should be a set of nodes which can read locally without having to propose those reads through Raft. The natural choice for the number N of nodes to which that read lease should be given is the size of a quorum (so 3 for a 5-replica Raft, 2 for 3-replica) - more is possible but slows down Raft. Leader leases correspond to the case N=1. Each non-leader needs to talk to a majority of the nodes and obtain a promise from them.
I was asking because I just couldn't figure out what you ment in your last message, but it seemed to suggest that non-leader leases are fairly trivial, which I think isn't true - they're a bit trickier than leader-only leases, albeit not that much. But, again, I might've completely missed your point.
@tschottdorf
There are several ways to provide what you (based on my reading on the cockroach design docs) want. Leasing is just one of them.
To make it clear, 1) we want Client A that sends the read request on Object 1 at T1 get the most up-to-date result at T2.
T1 is the realtime at which the client sent out the request. T2 is the realtime at which the client received the result.
Most up-to-date means the result is at least as new as the state of the object at T1.
Together with the sequential consistency raft (any consistent replicated log system) already provided, we can get linearizability (not reorder any invocations and responses).
Leasing is one way to provide 1).
For leader lease, it is obvious. The leader with lease can reply the read request without waiting for any network i/o. For quorum lease, any server(quorum) with the lease can reply the read request without waiting for any network i/o. But when you want to update a object, you might revoke the lease. The server without the lease (revoked) needs to wait for the new lease.
There is another simper way to solve it if we can tolerate extra latency on non-leader servers. Put it simple, if a server that has heard from the leader on an event that happened after T1, it can reply all the read requests before or at T1. And it also ensures 1).
Ah, now I get where you're coming from. We have different kinds of reads. What I was thinking about mostly here are a) reads carried out within a recently started transaction and b) reads without an explicit read timestamp (in which case they read the "current" value).
The simplest case is when you just want to read a bunch of values at a fixed, past timestamp (something like Spanner's snapshot read). If the timestamp is far enough in the past, any node should just be able to pass out those values (as long as it's committed log entries which are in the far future of the desired read timestamp). So that part doesn't need leader leases but it's also not for the operations that we want #230 for (at least that's how I understood it) - it's rather something else that we could consider implementing as it's easy to do and could significantly boost performance for some deployments. If you want stale reads, just read slightly in the past.
Generally though, I think that the majority of reads will probably be a) without explicit timestamp and b) in transactions which don't run long and should be fast (a) and b) are kind of the same thing). None of those two would want to sit around waiting and then try to profit from snapshot reads and (I think) we need consensus-level support for that.
Additionally there's the complication of the read-timestamp cache which we haven't thought about in this context yet. Basically a transactional write will return the timestamp of when the key was last read, which is hard to maintain if multiple followers can hand out values without consulting with the leader, no matter the underlying concept (the read-timestamp cache is kept on the leader). That sounds like an additional obstacle to follower leases which I could imagine might kill the idea altogether (?).
It's good that you're bringing this up, as clearly there's a need for clarification and debate.
@tschottdorf I believe leader leases will work fine for the case of your third paragraph (reads at current time or in txns). In practice anyway; adversarial scenarios could easily be invented where a clock could be changed to invalidate the leader lease guarantee. Pushing these reads into consensus would be pretty awful from a latency perspective.
@tschottdorf yes, the timestamp cache complicates matters. But maybe doesn't kill the idea of using quorum leases. One facet of quorum leases is that they require consensus on writes to be reached on the same set of replicas as reads. This means that you'd send the write to any replica which serviced a [potentially] conflicting read. This means that the replica with a conflicting read could refuse consensus. There's still the problem of communicating the conflict back to the reader so that it could update it's read timestamp cache and retry. Yuck. But still, I think the idea isn't dead.
@xiang90 Unfortunately, we can't quite say that a read is legal at time T in the event that we've heard from the leader at time > T. The problem is that transactional writes may arrive and modify keys at timestamps in the past (@ transaction's commit timestamp). The leader replica for a range prevents writes from "rewriting" history by keeping track of recent timestamps at which keys were read to push a conflicting transaction's write forward in time.
@spencerkimball Right. I was not aware of the potential "history rewrite". This actually complicates the problem a little bit, since the actual command ordering can be reordered with the user faced ordering.
Leasing approach should still work since it does not depend on the internal command ordering.
@spencerkimball: yes, all up my alley, my point in that third paragraph was that you require leases for those reads (as opposed to serving stale/past-and-thus-safe values that don't know about Raft).
@spencerkimball: Hmm, you're right, that would work, even though it would slow down writes by at least a factor of two when their keys are busy or worse - if a follower sees a lot of reads the write may just never actually succeed. So we need an additional mechanism that gives writes precedence over reads - possibly just pushing further reads for a key that caused a restart through Raft until the corresponding write succeeds is an okay choice or even not reading locally any more until the current lease expires (which is a pretty short time but makes sure that write bursts don't trickle through with a retry each). Or you could get fancy and keep read/write ratio stats and decide on the fly. Almost no writes? Just serve reads whenever you can. Many reads and writes? Probably want to make sure those writes go through. Luckily most of that logic will be outside of Raft with the appropriate hooks.
The implementation in upstream raft would have to be done very carefully, at the very least we'd need a hook that goes into stepFollower and allows refusal of a proposition (with a custom message) that will be sent to the leader, plus a corresponding hook in stepLeader that takes action when it receives a refusal (before retrying the proposal).
Regarding stale/snapshot reads ("read whatever you find"/"read consistently at the given snapshot"), is that something we want? If so, let's put it in an extra issue.
Here's a stab at lightweight leader leases that piggyback completely on existing messages. When designing this I was mostly concerned about the messaging overhead when there are many Raft groups and I hope that the proposal below strikes a good balance -- see it as a working draft right now; I just wrote it down. Basically the idea is that
Here's the suggested implementation, please scrutinize thoroughly.
LeaseQuorum
. It is not to be confused with the set of nodes which has a read lease in follower leases (which here is always exactly one node, the leader). As nodes join or leave the consensus group, the set must be updated immediately so that it remains at quorum size (technically it can be larger, but that has only downsides).MsgApp
is extended to contain LeaseQuorum
; the current LeaseQuorum is sent along for each outgoing MsgApp
(one could argue that it is enough to attach it for members of LeaseQuorum
only, but I'm going to ignore that for now). In terms of the quorum lease paper, that message serves roughly as a GuardAck
, but with additional information.MsgApp
could also contain leaseDuration
if we want that to be dynamic (which could make sense because it depends on the leader-to-quorum latencies, which in turn vary by leader; there would need to be a config var maxLeaseDuration
because booting nodes may need to wait out a fixed interval) - that would lead to straightforward changes below. But for now I'll just assume it's a fixed config var.MsgApp
from a leader considered legitimate, a follower will take action if it is included in the LeaseQuorum
:
MsgAppResp
indicating the problem, but likely not -- see the handling of EventLeaderElection
below.LeaseQuorum
for that leader, update the lease expiry to recvMsgApp+leaseDuration
.LeaseQuorum
for that leader (note that an empty or nonexistent LeaseQuorum
qualifies), drop that lease and then treat the message as having been received without a lease; see below.LeaseQuorum
, instantiate a lease for the leading node. That includesrecvMsgApp+leaseDuration
LeaseQuorum
(voting for members of LeaseQuorum
is ok because they know when they are allowed to campaign).EventLeaderElection
(for a future term) instantly invalidates all grants. No other event will change the follower's leader until the grant expires.leaseDuration
before considering itself leader. That may seem counter-intuitive because granters will not vote for non-granters, but the booting node may actually have issued a grant before it got restarted. To avoid having to deal with this during testing and playing around with Cockroach, it's probably most appropriate to just time.Sleep(leaseDuration)
when the node turns into a candidate without having sent out a MsgAppReq
previously (which would have reestablished any missing leases) and storage.LastIndex() > 0
.MsgApp
sent to a member of LeaseQuorum
, remember the timestamp sendMsgApp+leaseDuration
for that recipient.MsgAppResp
from that member, remember that during [recvMsgAppResp, sendMsgApp+leaseDuration]
that node is aware of our desire to read free from the worries of consensus.LeaseQuorum
(with the empty interval for nodes which haven't granted a lease). If that is true, read locally. Otherwise, send it through Raft (hence trying to acquire a lease automatically).Note that
LeaseQuorum
in a Raft quorum. That requirement comes with follower leases only.I'm hoping there's nothing completely wrong in the above outline as that would make for a very performant implementation with no extra traffic.
As for follower leases, my main pet peeve is that there would be a lot of regular traffic between a majority and each lease follower node. For example, in the case of a Raft group of size 5, there would be three read leases minus one leader, leaving two nodes which regularly have to talk to two (three minus the leader) other nodes on top of the regular Raft chatter. That's like holding elections approximately every heartbeat. That times millions of groups is not something we want to do and the only way out, I think, is TrueTime. I've mentioned this a couple of times but nobody has really commented on it.
Happy to be convinced otherwise, but in the meantime I would make the case that since we're moving data around with high granularity on a per Range basis, each logical node will hold many groups and only lead a third of them. That should spread read load fairly evenly except for isolated hotspots, in which case a) the range should split there based on load and/or b) we should offer stale reads.
Of course it would be great to be able to have clients far away from the leader read from their local replica, but unless there's a sleek way I'm not seeing this isn't something we can do efficiently for large multi-datacenter deployments with a lot of data (and hence Raft groups), which I suppose are the ones that would want that feature most.
That plus fairly involved changes in Raft plus the fact that Cockroach and Raft would have to be very very tightly coupled (followers would have to pause Raft, go through all acknowledged-but-uncommitted log entries, inspect them and make sure that there are no updates to the key that is being read on a lease; plus the read-timestamp cache issue above) - I suspect we'll be better of efficiently implementing leader leases for now (note that the design above could be extended to quorum leases later).
These comments apply to your comments two posts back...
@tschottdorf yes, we want a separate issue for inconsistent writes. We actually want to do those as a matter of course when looking up meta[12] indexing entries.
@tschottdorf: having follower exit the quorum lease in the event of a write / read-timestamp-cache conflict would be too coarse-grained. I would favor something simpler: push transaction forward on leader write to current time + reasonable delta. The "reasonable delta" would be enough to cover the likely latency of pushing newly-advanced write to the consensus group.
@spencerkimball you mean inconsistent reads, right?
@spencerkimball can you elaborate? I'm assuming you'll still have the feedback from the node, so you try to append something and get an error which tells you that the follower's timestamp cache has a newer entry. Now what? You retry the write with that higher timestamp plus reasonable delta? That would mean writing into the future deliberately, hence causing restarts for the reads that were blocking the write in the first place. If that's not the intention, then what's the delta for?
I think I may have misunderstood your proposal. We might need to do a hangout to sort it out if so.
I had a different concept for leader leases in mind. Leadership itself is the result of a quorum. The interval for which a leader holds its lease can always be measured from the time at which leadership was proposed until a quorum of followers reaches heartbeat expiration.
The leader holds the read lease for the range so long as its wall time falls within the lease expiration, and can serve reads locally when this is true, without needing to resort to a Raft consensus.
My current opinion is that for V1, reading from leader will work just fine. More efficient quorum leases are something we'll definitely need to figure out for next version though.
My proposal is exactly the same concept but taking into account the fact that you must never rely on coalesced heartbeats for anything. The leader's original heartbeats are simply dropped and heartbeats received by followers mean nothing except that the last known leader had been alive at some point in the past. So unless you want to introduce new heartbeats which can not be coalesced and hence invalidate a basic motivation for multiraft, I think you will always end up close to my suggestion. Which is to use MsgApp and it's response to set the lease intervals.
It all depends on the traffic issue I've mentioned in almost every posting this week. That's the enemy and needs to be discussed more than anything else.
I'm in the car for the rest of the night, but let's HO tomorrow or Monday.
Implementations of quorum leases in which certain nodes are allowed to serve reads locally based on time intervals are tricky since they require a lot of communication (at least once you move away from pure leader leases). I think we can eliminate the reference to clock time completely from the core logic and replace it by constraints on the leadership transition.
What we get is pretty close to the holy grail:
_Disclaimer: text below is the result of not getting much sleep last night._
In a nutshell, the basic idea is the following:
Add a ReadLeaseConfig
entry type that the leader may use in MsgApp
.
It contains a list of nodes which are granted a read lease. The leader may send out such an entry whenever it desires, changing the configuration as it deems best. The ReadLeaseConfig
will usually always include the leader who proposed it, but that is not a requirement. There is also no need for the set to form a quorum - it can be anything from empty (disabling read leases) to everybody (not smart, but possible).
The active ReadLeaseConfig
of a node is its last committed entry in the current
term.
Any state transition (follower, leader, candidate) should reset the active config to the empty one (which is also the initial state).
The leases are in effect on every node that have themselves in their active
ReadLeaseConfig
and are in either leader or follower state, unless explicitly suspended.
When a lease is in effect, the node may serve reads locally, indefinitely.
ReadLeaseConfig
supports it.A node may only step up as leader and attempt to commit a command if it has the hosts from its active ReadLeaseConfig
voting for them.
But all nodes campaign as usual.
Note the particular wording above: A leader may step up without all the required votes, but it must not commit anything until either the votes missing from ReadLeaseConfig
arrive or the ReadQuorumConfig
gets changed by mechanisms in the next section.
That sounds almost too good to be true, and of course it is. As long as the nodes from the active ReadLeaseConfig
are responsive, everything is fine. But what if one of those nodes dies?
Without a further ingredient, that will simply deadlock Raft since no more progress can be made and no elections can succeed.
For that case, we need
This section is where timing makes a comeback, albeit in a weaker form. Basically the idea is to exclude the offending (i.e. non-responding) node from receiving heartbeats, which within some maximal amount of time must have the offender turn into a candidate.
We know that we need to take action when...
Usually both will be true simultaneously (but there are certain contrived scenarios in which it's different, hence the fine distinction).
If either of the prerequisites for this section hold, we want the leader to exclude the offending follower from receiving any more heartbeats for a limited amount of Raft ticks.
This is incompatible with our current implementation of coalesced heartbeats, but I have a change in mind that will make this possible - I'll skip those details here for brevity and write the presentation for single-node Raft.
MsgHeartbeatResp
or (successful) MsgAppResp
from a majority of its (supposed) followers in due time, it immediately suspends the local read lease (if any) until the majority is back.When a node that considers itself leader does not receive a MsgHeartbeatResp
or MsgAppResp
from a node within a few consecutive MsgHeartbeat
s it sends to that node, mark that node for forced removal.
That means that the node will skip a fixed number of heartbeats, large enough so that the offender will have to know something is wrong, and that it will be removed from ReadLeaseConfig
when the first next heartbeat is sent out to that node.
Once that has happened, the leader may proceed to send that new ReadLeaseConfig
out into the cluster (injected before all other pending commands).
The effect of this is that of effectively invalidating the offending node's lease by making sure it isn't heartbeat (if it thinks it's a leader, then it'll suspend its lease by means of not having had regular contact with a majority; see above).
Similarly to the last point, mark the node for forced removal if it doesn't make progress (with a suitable criterion).
Likely a per-peer tick counter on the leader which ticks along as long as the node is behind the leader, resetting for each MsgAppResp
, is appropriate. Having something like that would be nice anyways, to have a notion of a follower being active.
In practice, this should be exceedingly rare: The node would have to reply to heartbeats, but not make progress. It's possibly a little less rare with MultiRaft though, for example if one of the Raft groups deadlocks and coalesced heartbeats continue.
If a new leader steps up (this is likely in the scenario of the old leader dying), it may not commit a command until it has received votes from the ReadLeaseConfig
active before they stepped up (but it will heartbeat clients and be, from the view of others, a normal legitimate leader).
It may, however, send append requests, but without advancing the Commit
index.
The mechanism for forcible removal will be in force, so once the old leader misses a couple of heartbeats or does not respond to MsgApp
, it will be forcibly removed. The leader may then, as its first course of action, append a new ReadLeaseConfig
to its followers.
This is fairly elegant because as we time-out the dead nodes with leases, Raft "almost" makes progress (except that it does not apply to the state machine). That is way better than having everything go into lockdown.
Implementation could+should be done very cleanly in etcd/raft
.
But we will have to figure out everything related to coalesced heartbeats very carefully in advance and provide the right configurability.
Skipping the heartbeats (as part of forced revocations) can't really be communicated to MultiRaft unless we put things into Raft that look weird, unfortunately.
Likely we'll have some code in MultiRaft emulating this plus the code in etcd/raft
doing the same thing (unless we find a good way to get that information out of etcd/raft
).
Coalesced heartbeats should be sent more often than the underlying real heartbeats
This is to counteract timing issues which can arise in the above algorithms due to the fact that group-sent heartbeats and responses are discarded and hence some time may pass between when the Raft group thinks its heartbeat went out and it really happened.
etcd/raft
is lax on this point because it didn't matter, but it becomes important here to avoid a stale leader interfering with a forced lease revocation.Membership changes need some more thinking about, but there should be ways to handle it (trivially, we could enforce a complete revokal of all leases before adding a node and only put leases back when the transition is over, but there's probably a much better way).
For us, this comes free with coalesced heartbeats. But upstream needs to be more aware of it than it is now.
Using the latest committed (really, latest applied) ReadLeaseConfig is problematic for the same reasons as for config changes (discussed on page 36 of thesis.pdf): there is no acknowledgement to the leader when a node considers an entry committed or applied. I think to be conservative we should process ReadLeaseConfig at both append and apply time: leases are revoked as soon as the entry is appended (so the MsgAppResp serves as confirmation that the lease has been revoked) and new leases are added in a pending state when the entry is appended and become final when it is applied. Pending leases are used for the leader safety criterion (a new leader cannot commit until it has heard from or revoked all nodes with pending leases), but a node cannot serve reads until the entry that granted it the lease is committed and applied.
Instead of stopping delivery of MsgHeartbeat, we could add a RevokeLease flag to MsgHeartbeat (and the response). This might be easier to route through multiraft, since you'd have an explicit indicator of what's going on, instead of the absence of a message.
Leader leases have been implemented as of #604
We need to introduce a concept of leader leases so a leader can take read-only actions without writing them to the raft log. This means