bdarnell commented 9 years ago

We need to introduce a concept of leader leases so a leader can take read-only actions without writing them to the raft log. This means

Ensuring that no node calls an election without first waiting for at least the leader lease timeout (including just after starting up)
Keeping track of heartbeat responses and reverting to follower status if a timeout elapses without a quorum of heartbeat responses.

xiang90 commented 9 years ago

Have you looked at http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-14-105.pdf?

cockroach-team commented 9 years ago

Wow, this is excellent. I've been lying awake at night worrying about the shortcomings of leader leases as I'd imagined they had to work.

On Wed, Dec 17, 2014 at 7:28 PM, Xiang Li notifications@github.com wrote:

Have you looked at http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-14-105.pdf?

— Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/230#issuecomment-67422995 .

You received this message because you are subscribed to the Google Groups "Cockroach DB" group. To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/d/optout.

bdarnell commented 9 years ago

@xiang90 I hadn't seen that; thanks for the reference. It looks like a great fit for our needs.

(A reminder to whoever replied by email that your name doesn't come through on github when you do that, so you may want to come to github to post directly and be identified)

spencerkimball commented 9 years ago

Sorry, that was me.

Are there any plans to implement the performance optimization on writes mentioned in the paper?

Because a major focus for all leasing strategies is to reduce latency in the wide area, we implemented as
part of our baseline a latency optimization described by Castro [5]. This optimization reduces the commit
latency perceived by clients that are co-located with a replica other than the Multi-Paxos stable leader, by
having other replicas transmit their AcceptReplies to both the stable leader and to the replica near the
client. Thus, in the common case, the client does not need to wait the additional time for a message to
come back from the stable leader, which reduces commit time from four one-way delays to three.

Citation is: [5] Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems, 20(4):398–461, November 2002.

xiang90 commented 9 years ago

@bdarnell @spencerkimball We do not have the plan to implement the read lease in etcd itself right now. But if you want to design/implement it, I would be happy to help.

tbg commented 9 years ago

I'm taking a look at this now. How far should we take what is outlined in the paper?

It looks like we want leader leases only, so we would not need the full flexibility in granting leases to multiple replicas (and don't need to keep as much of a 'lease state' as they do). The one aspect of the paper that I don't see in Ben's initial post is the 'guard' part of the lease granting message (which avoids losing time/deadlocking when a node bails while the lease is communicated). If we have a relatively low lease duration and refresh the leases automatically on heartbeats (plus any other Raft-traffic that acknowledges the leader), the overhead might not be worth it. We'll definitely need to implement the guarded promises for the initial grant.

Some more questions/remarks:

Do we do this inside of Raft, or Multiraft, or completely outside? Depending on this, we'll have to figure out if we want to rely on the Tickers for wall time durations.
We should probably put into place a mechanism that elects the node that most requests arrive at as a leader. That should boil down simply to sending a special Raft message to the suggested new leader which will then start elections (keeping in mind the lease, of course). This can be put in an extra issue though.
If we manage to treat most Raft messages to the leader as lease grant messages, a lease timeout on the order of 2*heartbeatInterval (or even less) might be ok: Replicas with very little traffic might lose their lease, but perhaps that's ok, and they could get it back once they need it. The goal would be to push the lease duration down as much as possible to minimize the duration of outage when a node goes down.
Using the coalesced heartbeats for this purpose will have to be done carefully because their informational content is less than for replica-level heartbeats. Plus, we might want to send less heartbeats in the future as long as there's regular Raft traffic, so that needs to be kept in mind.

bdarnell commented 9 years ago

We will eventually want quorum leases so we can move read load to the followers, although we can start with just leader leases. I think this eventually belongs in upstream raft, although in the short term we can probably move faster by doing the work in multiraft or in our fork of etcd to start.

Any attempt to bias leader selection towards the nodes that have the most traffic should definitely be deferred to a separate issue/PR.

xiang90 commented 9 years ago

I think this eventually belongs in upstream raft

One difficulties of quorum lease is that the original algo needs to be object awareness.

For the leader lease, we can simply buffer the proposals for longer than max(electionTiemout), so that the old leader can step-down.

We might put the step-down stuff into raft I think.

xiang90 commented 9 years ago

@bdarnell Also, if we can loose the latency requirement(xx ms vs x ms) it would not be hard for the followers to take read (improve the throughput) as well as ensuring linearizability IMO.

tbg commented 9 years ago

@xiang90 which latency requirement do you mean? From what I can see, leader leases can piggyback on existing traffic to some extent, but once you want to want a follower lease, that follower will need its own lease granted by a majority of the cluster, so you're looking at something like elections for each follower (who gets a lease).

Object awareness is something that we likely don't need here (given that ranges are supposed to represent locality of the data already) but it could also be implemented in etcd/raft by providing an interface that maps a read message to, say, uint64 which is then interpreted as object type.

xiang90 commented 9 years ago

@tschottdorf Let us step back a little bit. Why do we want the lease? What kind of guarantees do we want to have?

tbg commented 9 years ago

We just want what's described in the paper you posted: There should be a set of nodes which can read locally without having to propose those reads through Raft. The natural choice for the number N of nodes to which that read lease should be given is the size of a quorum (so 3 for a 5-replica Raft, 2 for 3-replica) - more is possible but slows down Raft. Leader leases correspond to the case N=1. Each non-leader needs to talk to a majority of the nodes and obtain a promise from them.

I was asking because I just couldn't figure out what you ment in your last message, but it seemed to suggest that non-leader leases are fairly trivial, which I think isn't true - they're a bit trickier than leader-only leases, albeit not that much. But, again, I might've completely missed your point.

xiang90 commented 9 years ago

@tschottdorf

There are several ways to provide what you (based on my reading on the cockroach design docs) want. Leasing is just one of them.

To make it clear, 1) we want Client A that sends the read request on Object 1 at T1 get the most up-to-date result at T2.

T1 is the realtime at which the client sent out the request. T2 is the realtime at which the client received the result.

Most up-to-date means the result is at least as new as the state of the object at T1.

Together with the sequential consistency raft (any consistent replicated log system) already provided, we can get linearizability (not reorder any invocations and responses).

Leasing is one way to provide 1).

For leader lease, it is obvious. The leader with lease can reply the read request without waiting for any network i/o. For quorum lease, any server(quorum) with the lease can reply the read request without waiting for any network i/o. But when you want to update a object, you might revoke the lease. The server without the lease (revoked) needs to wait for the new lease.

There is another simper way to solve it if we can tolerate extra latency on non-leader servers. Put it simple, if a server that has heard from the leader on an event that happened after T1, it can reply all the read requests before or at T1. And it also ensures 1).

tbg commented 9 years ago

Ah, now I get where you're coming from. We have different kinds of reads. What I was thinking about mostly here are a) reads carried out within a recently started transaction and b) reads without an explicit read timestamp (in which case they read the "current" value).

The simplest case is when you just want to read a bunch of values at a fixed, past timestamp (something like Spanner's snapshot read). If the timestamp is far enough in the past, any node should just be able to pass out those values (as long as it's committed log entries which are in the far future of the desired read timestamp). So that part doesn't need leader leases but it's also not for the operations that we want #230 for (at least that's how I understood it) - it's rather something else that we could consider implementing as it's easy to do and could significantly boost performance for some deployments. If you want stale reads, just read slightly in the past.

Generally though, I think that the majority of reads will probably be a) without explicit timestamp and b) in transactions which don't run long and should be fast (a) and b) are kind of the same thing). None of those two would want to sit around waiting and then try to profit from snapshot reads and (I think) we need consensus-level support for that.

Additionally there's the complication of the read-timestamp cache which we haven't thought about in this context yet. Basically a transactional write will return the timestamp of when the key was last read, which is hard to maintain if multiple followers can hand out values without consulting with the leader, no matter the underlying concept (the read-timestamp cache is kept on the leader). That sounds like an additional obstacle to follower leases which I could imagine might kill the idea altogether (?).

It's good that you're bringing this up, as clearly there's a need for clarification and debate.

spencerkimball commented 9 years ago

@tschottdorf I believe leader leases will work fine for the case of your third paragraph (reads at current time or in txns). In practice anyway; adversarial scenarios could easily be invented where a clock could be changed to invalidate the leader lease guarantee. Pushing these reads into consensus would be pretty awful from a latency perspective.

@tschottdorf yes, the timestamp cache complicates matters. But maybe doesn't kill the idea of using quorum leases. One facet of quorum leases is that they require consensus on writes to be reached on the same set of replicas as reads. This means that you'd send the write to any replica which serviced a [potentially] conflicting read. This means that the replica with a conflicting read could refuse consensus. There's still the problem of communicating the conflict back to the reader so that it could update it's read timestamp cache and retry. Yuck. But still, I think the idea isn't dead.

@xiang90 Unfortunately, we can't quite say that a read is legal at time T in the event that we've heard from the leader at time > T. The problem is that transactional writes may arrive and modify keys at timestamps in the past (@ transaction's commit timestamp). The leader replica for a range prevents writes from "rewriting" history by keeping track of recent timestamps at which keys were read to push a conflicting transaction's write forward in time.

xiang90 commented 9 years ago

@spencerkimball Right. I was not aware of the potential "history rewrite". This actually complicates the problem a little bit, since the actual command ordering can be reordered with the user faced ordering.

Leasing approach should still work since it does not depend on the internal command ordering.

tbg commented 9 years ago

@spencerkimball: yes, all up my alley, my point in that third paragraph was that you require leases for those reads (as opposed to serving stale/past-and-thus-safe values that don't know about Raft).

@spencerkimball: Hmm, you're right, that would work, even though it would slow down writes by at least a factor of two when their keys are busy or worse - if a follower sees a lot of reads the write may just never actually succeed. So we need an additional mechanism that gives writes precedence over reads - possibly just pushing further reads for a key that caused a restart through Raft until the corresponding write succeeds is an okay choice or even not reading locally any more until the current lease expires (which is a pretty short time but makes sure that write bursts don't trickle through with a retry each). Or you could get fancy and keep read/write ratio stats and decide on the fly. Almost no writes? Just serve reads whenever you can. Many reads and writes? Probably want to make sure those writes go through. Luckily most of that logic will be outside of Raft with the appropriate hooks.

The implementation in upstream raft would have to be done very carefully, at the very least we'd need a hook that goes into stepFollower and allows refusal of a proposition (with a custom message) that will be sent to the leader, plus a corresponding hook in stepLeader that takes action when it receives a refusal (before retrying the proposal).

Regarding stale/snapshot reads ("read whatever you find"/"read consistently at the given snapshot"), is that something we want? If so, let's put it in an extra issue.

tbg commented 9 years ago

Here's a stab at lightweight leader leases that piggyback completely on existing messages. When designing this I was mostly concerned about the messaging overhead when there are many Raft groups and I hope that the proposal below strikes a good balance -- see it as a working draft right now; I just wrote it down. Basically the idea is that

leader leases should be on-demand: if the leader doesn't do anything, I don't want periodic traffic because that doesn't scale to millions of consensus groups per node. when the leader proposes a command after a period of inactivitiy, it will get the lease and subsequent operations will renew it so that busy leaders always have one.
lease communication piggybacks completely on progress-related communication, so there's little to no overhead.
when the leader dies, progress is stalled as little as possible.

Here's the suggested implementation, please scrutinize thoroughly.

at every point in time, a leader knows which nodes it wants a lease from (whether it has or needs one). It makes sense for the leader to pick the nodes that are usually quickest in making progress and to keep the set fairly stable (even though that's not required for correctness), for instance by averaging the latencies of acks from nodes over, say, 500 ticks and updating the set accordingly. Call this set LeaseQuorum. It is not to be confused with the set of nodes which has a read lease in follower leases (which here is always exactly one node, the leader). As nodes join or leave the consensus group, the set must be updated immediately so that it remains at quorum size (technically it can be larger, but that has only downsides).
MsgApp is extended to contain LeaseQuorum; the current LeaseQuorum is sent along for each outgoing MsgApp (one could argue that it is enough to attach it for members of LeaseQuorum only, but I'm going to ignore that for now). In terms of the quorum lease paper, that message serves roughly as a GuardAck, but with additional information.
MsgApp could also contain leaseDuration if we want that to be dynamic (which could make sense because it depends on the leader-to-quorum latencies, which in turn vary by leader; there would need to be a config var maxLeaseDuration because booting nodes may need to wait out a fixed interval) - that would lead to straightforward changes below. But for now I'll just assume it's a fixed config var.
upon receipt of a MsgApp from a leader considered legitimate, a follower will take action if it is included in the LeaseQuorum:
- if it granted a lease to a different node, it will not respond. Usually this means that the message is from a stale leader or that the follower has been out of the loop for a while. Maybe there is some value in responding with a MsgAppResp indicating the problem, but likely not -- see the handling of EventLeaderElectionbelow.
- if it has granted a lease for that exact LeaseQuorum for that leader, update the lease expiry to recvMsgApp+leaseDuration.
- if if has granted a lease for a different LeaseQuorum for that leader (note that an empty or nonexistent LeaseQuorum qualifies), drop that lease and then treat the message as having been received without a lease; see below.
- if it hasn't granted a lease but is included in LeaseQuorum, instantiate a lease for the leading node. That includes
- setting the lease to auto-expire at timestamp recvMsgApp+leaseDuration
- not campaigning while that lease is active.
- not voting for candidates which aren't in LeaseQuorum (voting for members of LeaseQuorum is ok because they know when they are allowed to campaign).
an EventLeaderElection (for a future term) instantly invalidates all grants. No other event will change the follower's leader until the grant expires.
when a group boots up, it may need to sit out leaseDuration before considering itself leader. That may seem counter-intuitive because granters will not vote for non-granters, but the booting node may actually have issued a grant before it got restarted. To avoid having to deal with this during testing and playing around with Cockroach, it's probably most appropriate to just time.Sleep(leaseDuration) when the node turns into a candidate without having sent out a MsgAppReq previously (which would have reestablished any missing leases) and storage.LastIndex() > 0.
the leader meanwhile will do the following:
- for any MsgApp sent to a member of LeaseQuorum, remember the timestamp sendMsgApp+leaseDuration for that recipient.
- upon receiving a MsgAppResp from that member, remember that during [recvMsgAppResp, sendMsgApp+leaseDuration] that node is aware of our desire to read free from the worries of consensus.
- when considering a read request, check whether the current wall time is included in the above interval for all members of LeaseQuorum (with the empty interval for nodes which haven't granted a lease). If that is true, read locally. Otherwise, send it through Raft (hence trying to acquire a lease automatically).

Note that

there is no requirement about the leader having to include the LeaseQuorum in a Raft quorum. That requirement comes with follower leases only.
Raft will have to become walltime-aware (ticks aren't a great fit for this).
we're not relying on heartbeats, which is a smart choice given that they can not be relied upon for deciding who the leader is in our implementation of coalesced heartbeats.

I'm hoping there's nothing completely wrong in the above outline as that would make for a very performant implementation with no extra traffic.

As for follower leases, my main pet peeve is that there would be a lot of regular traffic between a majority and each lease follower node. For example, in the case of a Raft group of size 5, there would be three read leases minus one leader, leaving two nodes which regularly have to talk to two (three minus the leader) other nodes on top of the regular Raft chatter. That's like holding elections approximately every heartbeat. That times millions of groups is not something we want to do and the only way out, I think, is TrueTime. I've mentioned this a couple of times but nobody has really commented on it.

Happy to be convinced otherwise, but in the meantime I would make the case that since we're moving data around with high granularity on a per Range basis, each logical node will hold many groups and only lead a third of them. That should spread read load fairly evenly except for isolated hotspots, in which case a) the range should split there based on load and/or b) we should offer stale reads.

Of course it would be great to be able to have clients far away from the leader read from their local replica, but unless there's a sleek way I'm not seeing this isn't something we can do efficiently for large multi-datacenter deployments with a lot of data (and hence Raft groups), which I suppose are the ones that would want that feature most.

That plus fairly involved changes in Raft plus the fact that Cockroach and Raft would have to be very very tightly coupled (followers would have to pause Raft, go through all acknowledged-but-uncommitted log entries, inspect them and make sure that there are no updates to the key that is being read on a lease; plus the read-timestamp cache issue above) - I suspect we'll be better of efficiently implementing leader leases for now (note that the design above could be extended to quorum leases later).

spencerkimball commented 9 years ago

These comments apply to your comments two posts back...

@tschottdorf yes, we want a separate issue for inconsistent writes. We actually want to do those as a matter of course when looking up meta[12] indexing entries.

@tschottdorf: having follower exit the quorum lease in the event of a write / read-timestamp-cache conflict would be too coarse-grained. I would favor something simpler: push transaction forward on leader write to current time + reasonable delta. The "reasonable delta" would be enough to cover the likely latency of pushing newly-advanced write to the consensus group.

tbg commented 9 years ago

@spencerkimball you mean inconsistent reads, right?

@spencerkimball can you elaborate? I'm assuming you'll still have the feedback from the node, so you try to append something and get an error which tells you that the follower's timestamp cache has a newer entry. Now what? You retry the write with that higher timestamp plus reasonable delta? That would mean writing into the future deliberately, hence causing restarts for the reads that were blocking the write in the first place. If that's not the intention, then what's the delta for?

spencerkimball commented 9 years ago

I think I may have misunderstood your proposal. We might need to do a hangout to sort it out if so.

I had a different concept for leader leases in mind. Leadership itself is the result of a quorum. The interval for which a leader holds its lease can always be measured from the time at which leadership was proposed until a quorum of followers reaches heartbeat expiration.

The leader holds the read lease for the range so long as its wall time falls within the lease expiration, and can serve reads locally when this is true, without needing to resort to a Raft consensus.

My current opinion is that for V1, reading from leader will work just fine. More efficient quorum leases are something we'll definitely need to figure out for next version though.

tbg commented 9 years ago

My proposal is exactly the same concept but taking into account the fact that you must never rely on coalesced heartbeats for anything. The leader's original heartbeats are simply dropped and heartbeats received by followers mean nothing except that the last known leader had been alive at some point in the past. So unless you want to introduce new heartbeats which can not be coalesced and hence invalidate a basic motivation for multiraft, I think you will always end up close to my suggestion. Which is to use MsgApp and it's response to set the lease intervals.

It all depends on the traffic issue I've mentioned in almost every posting this week. That's the enemy and needs to be discussed more than anything else.

I'm in the car for the rest of the night, but let's HO tomorrow or Monday.

tbg commented 9 years ago

Timeless read leases

Implementations of quorum leases in which certain nodes are allowed to serve reads locally based on time intervals are tricky since they require a lot of communication (at least once you move away from pure leader leases). I think we can eliminate the reference to clock time completely from the core logic and replace it by constraints on the leadership transition.

What we get is pretty close to the holy grail:

grant read leases on-the-fly to any subset of nodes at no additional messaging overhead.
"almost" make Raft progress while leases for dead nodes time out (just delay advancing the commit index).

_Disclaimer: text below is the result of not getting much sleep last night._

In a nutshell, the basic idea is the following:

Granting leases

Add a ReadLeaseConfig entry type that the leader may use in MsgApp.

It contains a list of nodes which are granted a read lease. The leader may send out such an entry whenever it desires, changing the configuration as it deems best. The ReadLeaseConfig will usually always include the leader who proposed it, but that is not a requirement. There is also no need for the set to form a quorum - it can be anything from empty (disabling read leases) to everybody (not smart, but possible).
The active ReadLeaseConfig of a node is its last committed entry in the current term.

Any state transition (follower, leader, candidate) should reset the active config to the empty one (which is also the initial state).
The leases are in effect on every node that have themselves in their active ReadLeaseConfig and are in either leader or follower state, unless explicitly suspended.

When a lease is in effect, the node may serve reads locally, indefinitely.
A leader may only commit an entry locally if the active ReadLeaseConfig supports it.
A node may only step up as leader and attempt to commit a command if it has the hosts from its active ReadLeaseConfig voting for them. But all nodes campaign as usual.

Note the particular wording above: A leader may step up without all the required votes, but it must not commit anything until either the votes missing from ReadLeaseConfig arrive or the ReadQuorumConfig gets changed by mechanisms in the next section.

That sounds almost too good to be true, and of course it is. As long as the nodes from the active ReadLeaseConfig are responsive, everything is fine. But what if one of those nodes dies?

Without a further ingredient, that will simply deadlock Raft since no more progress can be made and no elections can succeed.

For that case, we need

Forced revocation of leases

This section is where timing makes a comeback, albeit in a weaker form. Basically the idea is to exclude the offending (i.e. non-responding) node from receiving heartbeats, which within some maximal amount of time must have the offender turn into a candidate.

We know that we need to take action when...

A follower stops making progress or stops responding to heartbeats

Usually both will be true simultaneously (but there are certain contrived scenarios in which it's different, hence the fine distinction).

If either of the prerequisites for this section hold, we want the leader to exclude the offending follower from receiving any more heartbeats for a limited amount of Raft ticks.

This is incompatible with our current implementation of coalesced heartbeats, but I have a change in mind that will make this possible - I'll skip those details here for brevity and write the presentation for single-node Raft.

Followers make note of the amount of ticks elapsed since the last heartbeat was received from (only) their supposed leader; when more than X ticks pass without a heartbeat, they suspend their local lease (until they get heartbeat again or there is Raft progress).
When a node that considers itself leader does not receive MsgHeartbeatResp or (successful) MsgAppResp from a majority of its (supposed) followers in due time, it immediately suspends the local read lease (if any) until the majority is back.
When a node that considers itself leader does not receive a MsgHeartbeatResp or MsgAppResp from a node within a few consecutive MsgHeartbeats it sends to that node, mark that node for forced removal.

That means that the node will skip a fixed number of heartbeats, large enough so that the offender will have to know something is wrong, and that it will be removed from ReadLeaseConfig when the first next heartbeat is sent out to that node.

Once that has happened, the leader may proceed to send that new ReadLeaseConfig out into the cluster (injected before all other pending commands).

The effect of this is that of effectively invalidating the offending node's lease by making sure it isn't heartbeat (if it thinks it's a leader, then it'll suspend its lease by means of not having had regular contact with a majority; see above).
Similarly to the last point, mark the node for forced removal if it doesn't make progress (with a suitable criterion).

Likely a per-peer tick counter on the leader which ticks along as long as the node is behind the leader, resetting for each MsgAppResp, is appropriate. Having something like that would be nice anyways, to have a notion of a follower being active.

In practice, this should be exceedingly rare: The node would have to reply to heartbeats, but not make progress. It's possibly a little less rare with MultiRaft though, for example if one of the Raft groups deadlocks and coalesced heartbeats continue.
If a new leader steps up (this is likely in the scenario of the old leader dying), it may not commit a command until it has received votes from the ReadLeaseConfig active before they stepped up (but it will heartbeat clients and be, from the view of others, a normal legitimate leader).

It may, however, send append requests, but without advancing the Commit index.

The mechanism for forcible removal will be in force, so once the old leader misses a couple of heartbeats or does not respond to MsgApp, it will be forcibly removed. The leader may then, as its first course of action, append a new ReadLeaseConfig to its followers.

This is fairly elegant because as we time-out the dead nodes with leases, Raft "almost" makes progress (except that it does not apply to the state machine). That is way better than having everything go into lockdown.

Caveats and "out in the weeds"-type considerations

Implementation could+should be done very cleanly in etcd/raft.

But we will have to figure out everything related to coalesced heartbeats very carefully in advance and provide the right configurability.
Skipping the heartbeats (as part of forced revocations) can't really be communicated to MultiRaft unless we put things into Raft that look weird, unfortunately.

Likely we'll have some code in MultiRaft emulating this plus the code in etcd/raft doing the same thing (unless we find a good way to get that information out of etcd/raft).
Coalesced heartbeats should be sent more often than the underlying real heartbeats

This is to counteract timing issues which can arise in the above algorithms due to the fact that group-sent heartbeats and responses are discarded and hence some time may pass between when the Raft group thinks its heartbeat went out and it really happened.
The various timeout tick intervals have to be adjusted so that everything comes together as planned. This has to carry over to the multiraft-induced timings as well.
Followers must only acknowledge heartbeats sent to them on behalf of who they think the leader is. Current handling in etcd/raft is lax on this point because it didn't matter, but it becomes important here to avoid a stale leader interfering with a forced lease revocation.
Membership changes need some more thinking about, but there should be ways to handle it (trivially, we could enforce a complete revokal of all leases before adding a node and only put leases back when the transition is over, but there's probably a much better way).

For us, this comes free with coalesced heartbeats. But upstream needs to be more aware of it than it is now.

bdarnell commented 9 years ago

Using the latest committed (really, latest applied) ReadLeaseConfig is problematic for the same reasons as for config changes (discussed on page 36 of thesis.pdf): there is no acknowledgement to the leader when a node considers an entry committed or applied. I think to be conservative we should process ReadLeaseConfig at both append and apply time: leases are revoked as soon as the entry is appended (so the MsgAppResp serves as confirmation that the lease has been revoked) and new leases are added in a pending state when the entry is appended and become final when it is applied. Pending leases are used for the leader safety criterion (a new leader cannot commit until it has heard from or revoked all nodes with pending leases), but a node cannot serve reads until the entry that granted it the lease is committed and applied.

Instead of stopping delivery of MsgHeartbeat, we could add a RevokeLease flag to MsgHeartbeat (and the response). This might be easier to route through multiraft, since you'd have an explicit indicator of what's going on, instead of the absence of a message.

spencerkimball commented 9 years ago

Leader leases have been implemented as of #604

cockroachdb / cockroach

Multiraft: leader leases #230

Timeless read leases

Granting leases

Forced revocation of leases

A follower stops making progress or stops responding to heartbeats

Caveats and "out in the weeds"-type considerations