cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.19k stars 3.81k forks source link

Multiraft: leader leases #230

Closed bdarnell closed 9 years ago

bdarnell commented 9 years ago

We need to introduce a concept of leader leases so a leader can take read-only actions without writing them to the raft log. This means

xiang90 commented 9 years ago

Have you looked at http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-14-105.pdf?

cockroach-team commented 9 years ago

Wow, this is excellent. I've been lying awake at night worrying about the shortcomings of leader leases as I'd imagined they had to work.

On Wed, Dec 17, 2014 at 7:28 PM, Xiang Li notifications@github.com wrote:

Have you looked at http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-14-105.pdf?

— Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/230#issuecomment-67422995 .

You received this message because you are subscribed to the Google Groups "Cockroach DB" group. To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/d/optout.

bdarnell commented 9 years ago

@xiang90 I hadn't seen that; thanks for the reference. It looks like a great fit for our needs.

(A reminder to whoever replied by email that your name doesn't come through on github when you do that, so you may want to come to github to post directly and be identified)

spencerkimball commented 9 years ago

Sorry, that was me.

Are there any plans to implement the performance optimization on writes mentioned in the paper?

Because a major focus for all leasing strategies is to reduce latency in the wide area, we implemented as
part of our baseline a latency optimization described by Castro [5]. This optimization reduces the commit
latency perceived by clients that are co-located with a replica other than the Multi-Paxos stable leader, by
having other replicas transmit their AcceptReplies to both the stable leader and to the replica near the
client. Thus, in the common case, the client does not need to wait the additional time for a message to
come back from the stable leader, which reduces commit time from four one-way delays to three.

Citation is: [5] Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems, 20(4):398–461, November 2002.

xiang90 commented 9 years ago

@bdarnell @spencerkimball We do not have the plan to implement the read lease in etcd itself right now. But if you want to design/implement it, I would be happy to help.

tbg commented 9 years ago

I'm taking a look at this now. How far should we take what is outlined in the paper?

It looks like we want leader leases only, so we would not need the full flexibility in granting leases to multiple replicas (and don't need to keep as much of a 'lease state' as they do). The one aspect of the paper that I don't see in Ben's initial post is the 'guard' part of the lease granting message (which avoids losing time/deadlocking when a node bails while the lease is communicated). If we have a relatively low lease duration and refresh the leases automatically on heartbeats (plus any other Raft-traffic that acknowledges the leader), the overhead might not be worth it. We'll definitely need to implement the guarded promises for the initial grant.

Some more questions/remarks:

bdarnell commented 9 years ago

We will eventually want quorum leases so we can move read load to the followers, although we can start with just leader leases. I think this eventually belongs in upstream raft, although in the short term we can probably move faster by doing the work in multiraft or in our fork of etcd to start.

Any attempt to bias leader selection towards the nodes that have the most traffic should definitely be deferred to a separate issue/PR.

xiang90 commented 9 years ago

I think this eventually belongs in upstream raft

One difficulties of quorum lease is that the original algo needs to be object awareness.

For the leader lease, we can simply buffer the proposals for longer than max(electionTiemout), so that the old leader can step-down.

We might put the step-down stuff into raft I think.

xiang90 commented 9 years ago

@bdarnell Also, if we can loose the latency requirement(xx ms vs x ms) it would not be hard for the followers to take read (improve the throughput) as well as ensuring linearizability IMO.

tbg commented 9 years ago

@xiang90 which latency requirement do you mean? From what I can see, leader leases can piggyback on existing traffic to some extent, but once you want to want a follower lease, that follower will need its own lease granted by a majority of the cluster, so you're looking at something like elections for each follower (who gets a lease).

Object awareness is something that we likely don't need here (given that ranges are supposed to represent locality of the data already) but it could also be implemented in etcd/raft by providing an interface that maps a read message to, say, uint64 which is then interpreted as object type.

xiang90 commented 9 years ago

@tschottdorf Let us step back a little bit. Why do we want the lease? What kind of guarantees do we want to have?

tbg commented 9 years ago

We just want what's described in the paper you posted: There should be a set of nodes which can read locally without having to propose those reads through Raft. The natural choice for the number N of nodes to which that read lease should be given is the size of a quorum (so 3 for a 5-replica Raft, 2 for 3-replica) - more is possible but slows down Raft. Leader leases correspond to the case N=1. Each non-leader needs to talk to a majority of the nodes and obtain a promise from them.

I was asking because I just couldn't figure out what you ment in your last message, but it seemed to suggest that non-leader leases are fairly trivial, which I think isn't true - they're a bit trickier than leader-only leases, albeit not that much. But, again, I might've completely missed your point.

xiang90 commented 9 years ago

@tschottdorf

There are several ways to provide what you (based on my reading on the cockroach design docs) want. Leasing is just one of them.

To make it clear, 1) we want Client A that sends the read request on Object 1 at T1 get the most up-to-date result at T2.

T1 is the realtime at which the client sent out the request. T2 is the realtime at which the client received the result.

Most up-to-date means the result is at least as new as the state of the object at T1.

Together with the sequential consistency raft (any consistent replicated log system) already provided, we can get linearizability (not reorder any invocations and responses).

Leasing is one way to provide 1).

For leader lease, it is obvious. The leader with lease can reply the read request without waiting for any network i/o. For quorum lease, any server(quorum) with the lease can reply the read request without waiting for any network i/o. But when you want to update a object, you might revoke the lease. The server without the lease (revoked) needs to wait for the new lease.

There is another simper way to solve it if we can tolerate extra latency on non-leader servers. Put it simple, if a server that has heard from the leader on an event that happened after T1, it can reply all the read requests before or at T1. And it also ensures 1).

tbg commented 9 years ago

Ah, now I get where you're coming from. We have different kinds of reads. What I was thinking about mostly here are a) reads carried out within a recently started transaction and b) reads without an explicit read timestamp (in which case they read the "current" value).

The simplest case is when you just want to read a bunch of values at a fixed, past timestamp (something like Spanner's snapshot read). If the timestamp is far enough in the past, any node should just be able to pass out those values (as long as it's committed log entries which are in the far future of the desired read timestamp). So that part doesn't need leader leases but it's also not for the operations that we want #230 for (at least that's how I understood it) - it's rather something else that we could consider implementing as it's easy to do and could significantly boost performance for some deployments. If you want stale reads, just read slightly in the past.

Generally though, I think that the majority of reads will probably be a) without explicit timestamp and b) in transactions which don't run long and should be fast (a) and b) are kind of the same thing). None of those two would want to sit around waiting and then try to profit from snapshot reads and (I think) we need consensus-level support for that.

Additionally there's the complication of the read-timestamp cache which we haven't thought about in this context yet. Basically a transactional write will return the timestamp of when the key was last read, which is hard to maintain if multiple followers can hand out values without consulting with the leader, no matter the underlying concept (the read-timestamp cache is kept on the leader). That sounds like an additional obstacle to follower leases which I could imagine might kill the idea altogether (?).

It's good that you're bringing this up, as clearly there's a need for clarification and debate.

spencerkimball commented 9 years ago

@tschottdorf I believe leader leases will work fine for the case of your third paragraph (reads at current time or in txns). In practice anyway; adversarial scenarios could easily be invented where a clock could be changed to invalidate the leader lease guarantee. Pushing these reads into consensus would be pretty awful from a latency perspective.

@tschottdorf yes, the timestamp cache complicates matters. But maybe doesn't kill the idea of using quorum leases. One facet of quorum leases is that they require consensus on writes to be reached on the same set of replicas as reads. This means that you'd send the write to any replica which serviced a [potentially] conflicting read. This means that the replica with a conflicting read could refuse consensus. There's still the problem of communicating the conflict back to the reader so that it could update it's read timestamp cache and retry. Yuck. But still, I think the idea isn't dead.

@xiang90 Unfortunately, we can't quite say that a read is legal at time T in the event that we've heard from the leader at time > T. The problem is that transactional writes may arrive and modify keys at timestamps in the past (@ transaction's commit timestamp). The leader replica for a range prevents writes from "rewriting" history by keeping track of recent timestamps at which keys were read to push a conflicting transaction's write forward in time.

xiang90 commented 9 years ago

@spencerkimball Right. I was not aware of the potential "history rewrite". This actually complicates the problem a little bit, since the actual command ordering can be reordered with the user faced ordering.

Leasing approach should still work since it does not depend on the internal command ordering.

tbg commented 9 years ago

@spencerkimball: yes, all up my alley, my point in that third paragraph was that you require leases for those reads (as opposed to serving stale/past-and-thus-safe values that don't know about Raft).

@spencerkimball: Hmm, you're right, that would work, even though it would slow down writes by at least a factor of two when their keys are busy or worse - if a follower sees a lot of reads the write may just never actually succeed. So we need an additional mechanism that gives writes precedence over reads - possibly just pushing further reads for a key that caused a restart through Raft until the corresponding write succeeds is an okay choice or even not reading locally any more until the current lease expires (which is a pretty short time but makes sure that write bursts don't trickle through with a retry each). Or you could get fancy and keep read/write ratio stats and decide on the fly. Almost no writes? Just serve reads whenever you can. Many reads and writes? Probably want to make sure those writes go through. Luckily most of that logic will be outside of Raft with the appropriate hooks.

The implementation in upstream raft would have to be done very carefully, at the very least we'd need a hook that goes into stepFollower and allows refusal of a proposition (with a custom message) that will be sent to the leader, plus a corresponding hook in stepLeader that takes action when it receives a refusal (before retrying the proposal).

Regarding stale/snapshot reads ("read whatever you find"/"read consistently at the given snapshot"), is that something we want? If so, let's put it in an extra issue.

tbg commented 9 years ago

Here's a stab at lightweight leader leases that piggyback completely on existing messages. When designing this I was mostly concerned about the messaging overhead when there are many Raft groups and I hope that the proposal below strikes a good balance -- see it as a working draft right now; I just wrote it down. Basically the idea is that

Here's the suggested implementation, please scrutinize thoroughly.

Note that

I'm hoping there's nothing completely wrong in the above outline as that would make for a very performant implementation with no extra traffic.

As for follower leases, my main pet peeve is that there would be a lot of regular traffic between a majority and each lease follower node. For example, in the case of a Raft group of size 5, there would be three read leases minus one leader, leaving two nodes which regularly have to talk to two (three minus the leader) other nodes on top of the regular Raft chatter. That's like holding elections approximately every heartbeat. That times millions of groups is not something we want to do and the only way out, I think, is TrueTime. I've mentioned this a couple of times but nobody has really commented on it.

Happy to be convinced otherwise, but in the meantime I would make the case that since we're moving data around with high granularity on a per Range basis, each logical node will hold many groups and only lead a third of them. That should spread read load fairly evenly except for isolated hotspots, in which case a) the range should split there based on load and/or b) we should offer stale reads.

Of course it would be great to be able to have clients far away from the leader read from their local replica, but unless there's a sleek way I'm not seeing this isn't something we can do efficiently for large multi-datacenter deployments with a lot of data (and hence Raft groups), which I suppose are the ones that would want that feature most.

That plus fairly involved changes in Raft plus the fact that Cockroach and Raft would have to be very very tightly coupled (followers would have to pause Raft, go through all acknowledged-but-uncommitted log entries, inspect them and make sure that there are no updates to the key that is being read on a lease; plus the read-timestamp cache issue above) - I suspect we'll be better of efficiently implementing leader leases for now (note that the design above could be extended to quorum leases later).

spencerkimball commented 9 years ago

These comments apply to your comments two posts back...

@tschottdorf yes, we want a separate issue for inconsistent writes. We actually want to do those as a matter of course when looking up meta[12] indexing entries.

@tschottdorf: having follower exit the quorum lease in the event of a write / read-timestamp-cache conflict would be too coarse-grained. I would favor something simpler: push transaction forward on leader write to current time + reasonable delta. The "reasonable delta" would be enough to cover the likely latency of pushing newly-advanced write to the consensus group.

tbg commented 9 years ago

@spencerkimball you mean inconsistent reads, right?

@spencerkimball can you elaborate? I'm assuming you'll still have the feedback from the node, so you try to append something and get an error which tells you that the follower's timestamp cache has a newer entry. Now what? You retry the write with that higher timestamp plus reasonable delta? That would mean writing into the future deliberately, hence causing restarts for the reads that were blocking the write in the first place. If that's not the intention, then what's the delta for?

spencerkimball commented 9 years ago

I think I may have misunderstood your proposal. We might need to do a hangout to sort it out if so.

I had a different concept for leader leases in mind. Leadership itself is the result of a quorum. The interval for which a leader holds its lease can always be measured from the time at which leadership was proposed until a quorum of followers reaches heartbeat expiration.

The leader holds the read lease for the range so long as its wall time falls within the lease expiration, and can serve reads locally when this is true, without needing to resort to a Raft consensus.

My current opinion is that for V1, reading from leader will work just fine. More efficient quorum leases are something we'll definitely need to figure out for next version though.

tbg commented 9 years ago

My proposal is exactly the same concept but taking into account the fact that you must never rely on coalesced heartbeats for anything. The leader's original heartbeats are simply dropped and heartbeats received by followers mean nothing except that the last known leader had been alive at some point in the past. So unless you want to introduce new heartbeats which can not be coalesced and hence invalidate a basic motivation for multiraft, I think you will always end up close to my suggestion. Which is to use MsgApp and it's response to set the lease intervals.

It all depends on the traffic issue I've mentioned in almost every posting this week. That's the enemy and needs to be discussed more than anything else.

I'm in the car for the rest of the night, but let's HO tomorrow or Monday.

tbg commented 9 years ago

Timeless read leases

Implementations of quorum leases in which certain nodes are allowed to serve reads locally based on time intervals are tricky since they require a lot of communication (at least once you move away from pure leader leases). I think we can eliminate the reference to clock time completely from the core logic and replace it by constraints on the leadership transition.

What we get is pretty close to the holy grail:

_Disclaimer: text below is the result of not getting much sleep last night._

In a nutshell, the basic idea is the following:

Granting leases

That sounds almost too good to be true, and of course it is. As long as the nodes from the active ReadLeaseConfig are responsive, everything is fine. But what if one of those nodes dies?

Without a further ingredient, that will simply deadlock Raft since no more progress can be made and no elections can succeed.

For that case, we need

Forced revocation of leases

This section is where timing makes a comeback, albeit in a weaker form. Basically the idea is to exclude the offending (i.e. non-responding) node from receiving heartbeats, which within some maximal amount of time must have the offender turn into a candidate.

We know that we need to take action when...

A follower stops making progress or stops responding to heartbeats

Usually both will be true simultaneously (but there are certain contrived scenarios in which it's different, hence the fine distinction).

If either of the prerequisites for this section hold, we want the leader to exclude the offending follower from receiving any more heartbeats for a limited amount of Raft ticks.

This is incompatible with our current implementation of coalesced heartbeats, but I have a change in mind that will make this possible - I'll skip those details here for brevity and write the presentation for single-node Raft.

Caveats and "out in the weeds"-type considerations

bdarnell commented 9 years ago

Using the latest committed (really, latest applied) ReadLeaseConfig is problematic for the same reasons as for config changes (discussed on page 36 of thesis.pdf): there is no acknowledgement to the leader when a node considers an entry committed or applied. I think to be conservative we should process ReadLeaseConfig at both append and apply time: leases are revoked as soon as the entry is appended (so the MsgAppResp serves as confirmation that the lease has been revoked) and new leases are added in a pending state when the entry is appended and become final when it is applied. Pending leases are used for the leader safety criterion (a new leader cannot commit until it has heard from or revoked all nodes with pending leases), but a node cannot serve reads until the entry that granted it the lease is committed and applied.

Instead of stopping delivery of MsgHeartbeat, we could add a RevokeLease flag to MsgHeartbeat (and the response). This might be easier to route through multiraft, since you'd have an explicit indicator of what's going on, instead of the absence of a message.

spencerkimball commented 9 years ago

Leader leases have been implemented as of #604