cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.18k stars 3.82k forks source link

Leader leases #543

Closed spencerkimball closed 9 years ago

spencerkimball commented 9 years ago
spencerkimball commented 9 years ago

@tschottdorf

spencerkimball commented 9 years ago

Tobias, care to consolidate the other lease-related issues here?

tbg commented 9 years ago

Yep, will do.

tbg commented 9 years ago

First steps for the work to be done in the dist_sender:

spencerkimball commented 9 years ago

Yes, let's start gossiping a NodeDescriptor.

The attributes are only sorted for the purposes of gossiping. The ones which are specified when starting a node can have an arbitrary order, and I think we should add a comment to that command line flag usage mentioning that attributes should be specified in affinity order. I don't believe you want to pick the replica with the maximum overlap. Better to pick the replica which matches the first node attribute. If none match, choose at random. If multiple match, do same algorithm with second attribute, and so on. The issue with maximum overlap is not all node attributes necessarily have anything to do with suitability when picking replica for lowest latency. For example, "gpu" might be a node attribute.

You just mention reads for the leader cache. All writes will have to go to the leader as well.

We probably will want to persist this cache (and gossip as well). Both items should be added to the TODO.md file.

Latencies would be a good addition. Maybe make a note in the code somewhere appropriate. We'll do that when / if necessary.

bdarnell commented 9 years ago

Notes from today's call:

tbg commented 9 years ago

How are we dealing with keeping the lease alive? I remember that we thought about doing it at the level of multiraft or even its transport, but with the store sending the initial LeaderLeaseRequest, it seems awkward to scatter the logic around, so in MultiRaft I would keep it to the minimum by simply making sure nodes don't vote while there's a running lease.

@spencerkimball relating our discussion of what information to send in a Lease, since it has to be inspected by MultiRaft, we need to hang on to the MultiRaft NodeID and GroupID at all times. We should just embed that information instead of a Replica.

Does that sound reasonable?

spencerkimball commented 9 years ago

Yes, this is what I had in mind. No timers or goroutines...just renew the lease if there's read pressure at the range within a generous offset of the lease expiration.

I'm fine with moving to RaftID (use this not "GroupID" as this data structure is shared outside of multiraft) and RaftNodeID instead of replica.

tbg commented 9 years ago

PTAL at a first stab at doing the work inside of MultiRaft (setting the deadlines for not voting). Search for "horrible" in the diff (func processLease): It's ugly that we have to break the abstraction between MultiRaft and the outside world and unmarshal everything once. Is there a natural way to improve this? We definitely have to synchronously update the deadline, or we might send out votes we shouldn't have sent out. We also don't want to introduce new Raft message types for leases as that would mess with Raft.

I figure we could use special client command IDs... but that only makes it more efficient by not unmarshalling for fun, not any more fun to look at.

spencerkimball commented 9 years ago

Closed by #604