Leader leases - Githubissues

spencerkimball commented 9 years ago

Rename Range.IsLeader() to Range.HaveLeaderLease(). This method returns whether or not the replica has the leader lease. Only replicas which have the leader lease my respond to "CONSISTENT" read requests or may satisfy read-write requests.
Any read-write request (e.g. Put, ConditionalPut, Delete, DeleteRange) must go through Range.addReadWriteCmd(). This method properly uses the read timestamp cache to adjust timestamps or restart transactions as necessary.
Raft forwarding needs to be disabled or else forwarded entries must be pushed through Range.addReadWriteCmd() at the replica for which HaveLeaderLease() is true.
All entries in terms before current leader's new term must be applied before HaveLeaderLease() method returns true.
Propose that blank entry which is first in a new leader's term be changed to a lease entry which serves same purpose, but additionally provides a lease interval, before which any accepting replica must not vote in a new election. This will require a change similar to the one used to support config changes. It will also require a hook inside raft which allows multiraft to decide whether or not to vote on an election according to any leases which it's agreed to.

spencerkimball commented 9 years ago

@tschottdorf

spencerkimball commented 9 years ago

Tobias, care to consolidate the other lease-related issues here?

tbg commented 9 years ago

Yep, will do.

tbg commented 9 years ago

First steps for the work to be done in the dist_sender:

give it access to the node's descriptor. Seems like it would be easiest if it got that from Gossip (we certainly want nodes to gossip their attributes at some point). Should I change the 'node-xxx' key to hold a storage.NodeDescriptor instead of a net.Addr or do we generally want to keep Gossip information bits human readable (which means an extra key, probably renaming node-* to node-id* and adding node-attributes*)?
figure out attribute semantics: When looking for a "close" replica (for inconsistent reads), you'd want to pick one from the same datacenter. Since we don't have any fixed attributes, and order is determined by string sorting, the only option you have for picking a replica that looks most like you is to take the one with the maximum number of overlapping attributes. That doesn't have to be the one you want, at least if attributes are used freely.
introduce leader cache: for consistent/consensus reads, try to send the RPC to the leader. This means going for the random (or closest - but should think about infinite loops) node when no leader is cached, and expiring/updating the cache from NotLeaderErrors (if old==new, just expire).
figure out if there's value in persisting that cache to make a node restart less awkward (for starters probably not, but maybe something to keep in mind).
possibly we want to remember latencies as we execute commands? Do we already have plans to put this somewhere?

spencerkimball commented 9 years ago

Yes, let's start gossiping a NodeDescriptor.

The attributes are only sorted for the purposes of gossiping. The ones which are specified when starting a node can have an arbitrary order, and I think we should add a comment to that command line flag usage mentioning that attributes should be specified in affinity order. I don't believe you want to pick the replica with the maximum overlap. Better to pick the replica which matches the first node attribute. If none match, choose at random. If multiple match, do same algorithm with second attribute, and so on. The issue with maximum overlap is not all node attributes necessarily have anything to do with suitability when picking replica for lowest latency. For example, "gpu" might be a node attribute.

You just mention reads for the leader cache. All writes will have to go to the leader as well.

We probably will want to persist this cache (and gossip as well). Both items should be added to the TODO.md file.

Latencies would be a good addition. Maybe make a note in the code somewhere appropriate. We'll do that when / if necessary.

bdarnell commented 9 years ago

Notes from today's call:

When Store gets an EventLeaderElected, it sends a leader lease request
When MultiRaft gets an MsgApp that includes a leader lease request, it sets a flag to disable vote responses for the duration of the lease
Followers drop all outgoing MsgVoteResps when they have granted an active lease
When a leader applies its own leader lease request, it has the leader lease and does all the stuff in the TODO comment
To avoid clock offset problems, each node only uses its own clock. Leaders start the clock when sending the leader lease request; followers start it when they acknowledge it. This ensures that leaders lose their lease before followers are able to vote again.

tbg commented 9 years ago

How are we dealing with keeping the lease alive? I remember that we thought about doing it at the level of multiraft or even its transport, but with the store sending the initial LeaderLeaseRequest, it seems awkward to scatter the logic around, so in MultiRaft I would keep it to the minimum by simply making sure nodes don't vote while there's a running lease.

@spencerkimball relating our discussion of what information to send in a Lease, since it has to be inspected by MultiRaft, we need to hang on to the MultiRaft NodeID and GroupID at all times. We should just embed that information instead of a Replica.

A node with a leader lease should try to renew it before it expires, as long as there's traffic. That can be done by the range itself, triggering the same procedure as with the initial lease some time before the lease expires.
If the lease expires (or never gets established in the first place), the range can't be sure it's still the leader. So the timing should be fairly gracious to avoid errors which then lead to awkward reelections.

Does that sound reasonable?

spencerkimball commented 9 years ago

Yes, this is what I had in mind. No timers or goroutines...just renew the lease if there's read pressure at the range within a generous offset of the lease expiration.

I'm fine with moving to RaftID (use this not "GroupID" as this data structure is shared outside of multiraft) and RaftNodeID instead of replica.

tbg commented 9 years ago

PTAL at a first stab at doing the work inside of MultiRaft (setting the deadlines for not voting). Search for "horrible" in the diff (func processLease): It's ugly that we have to break the abstraction between MultiRaft and the outside world and unmarshal everything once. Is there a natural way to improve this? We definitely have to synchronously update the deadline, or we might send out votes we shouldn't have sent out. We also don't want to introduce new Raft message types for leases as that would mess with Raft.

I figure we could use special client command IDs... but that only makes it more efficient by not unmarshalling for fun, not any more fun to look at.

spencerkimball commented 9 years ago

Closed by #604

cockroachdb / cockroach

Leader leases #543