Open andreimatei opened 3 years ago
n1 transfers the lease to n2 @ 6. n2's lease starts at 6. The expiration of n1's liveness epoch will expire @ 8 (if not pinged until then). The lease transfer includes a summary of the timestamp cache representing the read @ 9. n2's liveness epoch expires @ 8 (perhaps the node died).
Should this say "The expiration of n2's liveness epoch will expire @ 8"
Clock advances to 8, and the n2's liveness expires. I think I have it right...
Also, n1 has a lease starting at 10. is very confusing. How can a read at time 9 be served under a lease starting at 10? Should it be 1?
Yes, fixed, thanks.
I think @ajwerner was referring to point 2 where you're saying that n1's liveness epoch will expire at 8.
Ah, yes, sorry, that's n2
. Fixed.
@nvanbenschoten and I (and Nathan and Andrei, separately) spoke about this offline.
We discussed using the read summary to bump the timestamp cache after non-cooperative lease transfers as well. We probably want something similar to what mergeTrigger
does, to merge the existing read summary with the one created using the lease start time. Nathan mentioned that this approach allows us to support reading arbitrarily far (past the lease expiration time) in the future, as long as these reads go through Raft.
I have a few code pointers from our discussion and plan on working on this during breather week.
We also briefly discussed viewing this problem as a violation of the lease disjointness invariant, and possibly disallowing cooperative lease transfers if the outgoing leaseholder has served a future read. In some sense, the outgoing leaseholder revoking the remainder of its lease when it has already served a read in this interval is causing the issue as described. We rejected this approach because a workload with frequent future reads could prevent a range from ever being able to transfer its lease away.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
This issue should stay open, I think. Besides the bug itself, the issue has good discussion.
Consider the following scenario:
What went wrong here is that n3 naively assumed that, since it got a "brand new lease" (as opposed to a lease transfer), it can serve reads at timestamps above the lease start time. It made this reasoning because its lease is non-overlapping in time with the previous lease (that's the condition for it acquiring the lease), and so the previous leaseholder could not have served reads at timestamp above the new lease's start (since reads are not served above a lease's expiration). However, unfortunately, this reasoning cannot be applied recursively. Just because n3's lease cannot overlap the previous lease does not mean that it cannot overlap any previous lease. In this example, it overlap n1's lease.
This lease overlap happens because (cooperative) lease transfers can (surprisingly?) cause regressions in the lease's expiration time: n2's lease expires before n1's because n2's liveness epoch expires sooner.
Another way to see this issue, besides seeing it as a failure of a lease disjointness invariant, is to see it as a failure of the timestamp summary propagation mechanism. On (cooperative) lease transfers, the timestamp cache summary is appropriately moved from the old leaseholder's memory to the Raft log. The expectation of the log is that data is put in it will be "monotonic". But in this case it's not monotonic, since n3's ts cache regresses. This can be blamed on n3 not applying the respective log entry ; only the lease transferee applies the summary. We thought that only the new leaseholder needs to be concerned with the summary, but I think this issue shows that we were wrong.
I believe that this bug only manifests itself in the context of future-time reads (i.e. non-blocking transactions). With present-time reads, I think the fact that the read summaries are lost is inconsequential because n3's lease will start after n1's reads.
Interestingly, I believe we had a similar problem with closed timestamp (#66512). Brand new leases (as opposed to transfers) used to carry their lease start time as a closed timestamp, on the argument that it must be higher than any timestamp that might have been closed under a previous lease. Turns out that argument is flawed. We've fixed this by accident in #65624, when we've decided that carrying the lease start as a closed timestamp is no good (for a different reason).
cc @nvanbenschoten
Jira issue: CRDB-8096