cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.2k stars 3.82k forks source link

kvserver: future-time reads can be invalidated by new leases #66562

Open andreimatei opened 3 years ago

andreimatei commented 3 years ago

Consider the following scenario:

  1. n1 has a lease starting at 0. Present time is 5. The expiration of n1's liveness epoch will expire @ 10 (if not pinged until then). It serves a read at future time 9 for key k.
  2. n1 transfers the lease to n2 @ 6. n2's lease starts at 6. n2's liveness epoch will expire @ 8 (if not pinged until then). The lease transfer includes a summary of the timestamp cache representing the read @9.
  3. n2's liveness epoch expires @ 8 (perhaps the node died).
  4. n3 takes the lease @ 8. It initializes the timestamp cache with 8 across all the keys. The read @9 is not represented.
  5. n3 successfully evaluates a write of k @8. This write is invalid, as it's in the past of the read in step 1.

What went wrong here is that n3 naively assumed that, since it got a "brand new lease" (as opposed to a lease transfer), it can serve reads at timestamps above the lease start time. It made this reasoning because its lease is non-overlapping in time with the previous lease (that's the condition for it acquiring the lease), and so the previous leaseholder could not have served reads at timestamp above the new lease's start (since reads are not served above a lease's expiration). However, unfortunately, this reasoning cannot be applied recursively. Just because n3's lease cannot overlap the previous lease does not mean that it cannot overlap any previous lease. In this example, it overlap n1's lease.

This lease overlap happens because (cooperative) lease transfers can (surprisingly?) cause regressions in the lease's expiration time: n2's lease expires before n1's because n2's liveness epoch expires sooner.

Another way to see this issue, besides seeing it as a failure of a lease disjointness invariant, is to see it as a failure of the timestamp summary propagation mechanism. On (cooperative) lease transfers, the timestamp cache summary is appropriately moved from the old leaseholder's memory to the Raft log. The expectation of the log is that data is put in it will be "monotonic". But in this case it's not monotonic, since n3's ts cache regresses. This can be blamed on n3 not applying the respective log entry ; only the lease transferee applies the summary. We thought that only the new leaseholder needs to be concerned with the summary, but I think this issue shows that we were wrong.

I believe that this bug only manifests itself in the context of future-time reads (i.e. non-blocking transactions). With present-time reads, I think the fact that the read summaries are lost is inconsequential because n3's lease will start after n1's reads.


Interestingly, I believe we had a similar problem with closed timestamp (#66512). Brand new leases (as opposed to transfers) used to carry their lease start time as a closed timestamp, on the argument that it must be higher than any timestamp that might have been closed under a previous lease. Turns out that argument is flawed. We've fixed this by accident in #65624, when we've decided that carrying the lease start as a closed timestamp is no good (for a different reason).

cc @nvanbenschoten

Jira issue: CRDB-8096

andreimatei commented 3 years ago

n1 transfers the lease to n2 @ 6. n2's lease starts at 6. The expiration of n1's liveness epoch will expire @ 8 (if not pinged until then). The lease transfer includes a summary of the timestamp cache representing the read @ 9. n2's liveness epoch expires @ 8 (perhaps the node died).

Should this say "The expiration of n2's liveness epoch will expire @ 8"

Clock advances to 8, and the n2's liveness expires. I think I have it right...

Also, n1 has a lease starting at 10. is very confusing. How can a read at time 9 be served under a lease starting at 10? Should it be 1?

Yes, fixed, thanks.

aayushshah15 commented 3 years ago

I think @ajwerner was referring to point 2 where you're saying that n1's liveness epoch will expire at 8.

andreimatei commented 3 years ago

Ah, yes, sorry, that's n2. Fixed.

arulajmani commented 3 years ago

@nvanbenschoten and I (and Nathan and Andrei, separately) spoke about this offline.

We discussed using the read summary to bump the timestamp cache after non-cooperative lease transfers as well. We probably want something similar to what mergeTrigger does, to merge the existing read summary with the one created using the lease start time. Nathan mentioned that this approach allows us to support reading arbitrarily far (past the lease expiration time) in the future, as long as these reads go through Raft.

I have a few code pointers from our discussion and plan on working on this during breather week.


We also briefly discussed viewing this problem as a violation of the lease disjointness invariant, and possibly disallowing cooperative lease transfers if the outgoing leaseholder has served a future read. In some sense, the outgoing leaseholder revoking the remainder of its lease when it has already served a read in this interval is causing the issue as described. We rejected this approach because a workload with frequent future reads could prevent a range from ever being able to transfer its lease away.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

andreimatei commented 1 year ago

This issue should stay open, I think. Besides the bug itself, the issue has good discussion.