kv: stale read because intent from our past gets committed in our certain future?

Together with @nvanbenschoten and @ajwerner we're speculating on the following hazard:

txn W is coordinated by node B. It lays down an intent on node A (key k) at ts 95.
txn W gets pushed to ts 105 (taken from B's clock). It refreshes successfully and commits at 105. Node A's clock is at, say, 100; this is within clock offset bounds.
after all this, txn R starts on node A. It gets assigned ts 100. The txn has no uncertainty for node A.
txn W's async intent resolution comes around and resolves the intent on node A, moving the value fwd from ts 95 to 105. Node A's clock I think is not pushed to 105 by this ResolveIntent, but anyway the value of the clock does not matter any more.
txn R reads key k and doesn't see anything. There's a value at 105, but we have no uncertainty. We think it's in our future and so we ignore it.

This is a stale read. Are we missing anything? It seems that the "observed timestamps" mechanism (and the special case of no uncertainty on the gateway) seems to not work with the fact that we started changing intents' timestamps (through the refresh mechanism). I guess even before refreshing, we used to change intents' ts in SNAPSHOT txns, back in the day.

@tbg @spencerkimball @bdarnell @ajwerner @nvanbenschoten @petermattis

Epic: CRDB-1514

Jira issue: CRDB-4498

I'm too distant from the code to know whether this is or is not a problem, but that list of steps looks almost like a test. Seems like this could be transformed into a test and you can verify whether there is a problem or if there is a mechanism we're forgetting about which prevents the problem.

I think you're right; I can't think of anything that would prevent this stale read.

Node A's clock I think is not pushed to 105 by this ResolveIntent

This may be a problem in its own right.

I thought the refresh on the original txn would push node A’s clock forward to ts 105, which should be the low water mark for any subsequent txns coordinated by that node. Any node RPC coming in should forward the recipient node’s clock.

On Tue, Apr 2, 2019 at 3:46 PM Ben Darnell notifications@github.com wrote:

I think you're right; I can't think of anything that would prevent this stale read.

Node A's clock I think is not pushed to 105 by this ResolveIntent

This may be a problem in its own right.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-479168373, or mute the thread https://github.com/notifications/unsubscribe-auth/AF3MTXBMhmcbkx0qgzMZR8eOXdF4MlzZks5vc7OIgaJpZM4cY0cJ .

But we don't refresh blind writes, only reads. If we did send an RPC to each leaseholder that has at least one intent during the refresh to sync clocks then I think we would be ok.

Also, canForwardSerializableTimestamp is still going to be broken because that allows us to avoid the refresh entirely.

So the simplest fix I can think of for this is to add the needsRefresh flag to all Request types that have the isTxnWrite flag and to rip out canForwardSerializableTimestamp.

To formalize this a bit, I think the invariant we need to hold for these observed timestamps to be safe is that "a transaction can only commit if any node that serves as the leaseholder for one of its writes as an intent or as a committed value in the future has an HLC clock with an equal or higher timestamp than the commit timestamp".

We can simplify this a little bit. First, we assume that HLC clocks will never regress, so a clock reading before intent resolution will always be smaller than a clock reading after intent resolution. This allows us to ignore talking about committed values directly. Secondly, all forms of lease acquisitions ensure that the new leaseholder's clock is larger than the clock of the old leaseholder at the time that the old leaseholder stopped serving traffic. This allows us to ignore talking about future leaseholders directly.

So we can refine this invariant to "at the time of commit, any leaseholder for a Range that contains an intent for a transaction must have an HLC clock with an equal or higher timestamp than the transaction's commit timestamp."

Good find. I don't have much to add, Nathan's final invariant makes sense to me.

Node B's clock is at, say, 100;

You mean Node A here, right? The sentence earlier mentions that 105 was taken off Node B's clock.

You mean Node A here, right?

Yes, thanks for catching that. I updated the issue.

is this a release blocker? please add to https://github.com/cockroachdb/cockroach/issues/35554 if needed

It's not a release blocker because this behavior is not new. But it is bad and we have to do something. I personally still need to put thought into this, but others seem to be ahead of me.

+1 to what @andreimatei said. It's not a release blocker because we've had this issue since 2.0 and maybe always with snapshot transactions (back when we had those).

I've been trying to demonstrate that this can cause a single-key linearizability violation with https://github.com/cockroachdb/jepsen/pull/19. Once I'm confident that we have randomized testing that would catch this class of error, I'll go in and fix the actual issue.

@tbg how do you feel about reintroducing NoopRequest to solve this? The semantics we want are a read-only request that is addressed to a specific key but doesn't acquire any latches and is a no-op during evaluation. We need the request to evaluate on a leaseholder under a valid lease, but we absolutely don't want it grabbing a read-latch like a RefreshRequest would and blocking on pipelined writes.

SGTM. I would suggest to change the name but once the hlc updates are hoisted into the interceptors, it is going to be a true NoopRequest and the comment on the request should explain why it exists instead.

@nvanbenschoten , can we close this?

No, this hasn't been fixed.

Before reading, it is useful to refresh oneself on the meaning and correctness conditions of observed timestamps, as is detailed here.

Why we abandoned the proposed fix?

There are two reasons why the proposed fix in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-484247720 stalled even though it was all typed up back in August.

The first is that I was concerned about the impact it would have on performance. It would make transaction refreshing a more expensive operation because we would now need to send an RPC for every write a transaction has performed, in addition to every read. This means we'd also have to keep track of the spans that a transaction has written to, eating into the kv.transaction.max_refresh_spans_bytes budget. Of course, we already do this for intent tracking so maybe we could try merging the tracking of intent spans and refresh spans, but the structure of the txnPipeliner (which tracks intents) and the txnSpanRefresher (which tracks refresh spans) makes this tricky and required some work to get right.

The second, more important reason why the proposed fix stalled is that it was fundamentally incompatible with the Optimistic Global Transactions proposal (detailed first in the consistent read replica RFC and later in a non-public doc). The full version of this proposal relied upon the idea of a parallel commit with an "unbounded commit timestamp". In today's "bounded commit timestamp" form of parallel commits, a transaction stages its transaction record at a timestamp that it guarantees it has refreshed all reads up to. It then succeeds in committing if all of its parallel writes a) replicate successfully and b) lay down an intent at or below the staging txn record's timestamp. If any of these parallel writes get pushed above the staging txn record's timestamp due to the timestamp cache, a conflicting write, the closed timestamp, or something else, the commit attempt fails and must be retried. This is necessary because the transaction can only guarantee that its reads are valid up to the timestamp it wrote in its staging txn record.

This all works sufficiently well in today's expected cluster topologies and deployment modes. However, it runs into serious issues in the Optimistic Global Transactions proposal. This is because of a combination of three reasons.

First, the proposal assumes that communicating with the leaseholders of data that a txn is reading requires a WAN RTT - this is the case the proposal is trying to improve. If this isn't the case, the proposal reduces to exactly the current transaction model.
Second, the proposal relies upon dropping the closed timestamp duration down to O(10-100ms), on the order of a single WAN hop. This allows us to perform consistent reads from followers during transaction evaluation.
Third, the proposal assumes that all reads will be performed locally at first and necessarily validated/refreshed immediately before committing on their leaseholder.

Combined, these three constraints mean that it's simply not possible to refresh a transaction's reads to a timestamp pulled from a clock and then perform all of a transaction's writes at that timestamp without those write being bumped by the closed timestamp. The leaseholders for the reads might be remote and the leaseholders for the writes might be remote, so by the time the writes reach their leaseholders, they will necessarily be below the lowered closed timestamp. It simply doesn't work and forces us back to sequential WAN RTTs - something like write everything, refresh to the maximum write timestamp, then commit [1]. This undermines parallel commits and drives us far away from our goal of a transaction in a single WAN RTT.

To address this, the Optimistic Global Transactions proposal developed the idea of an "unbounded commit timestamp". The idea is that we build the determination of a commit timestamp into the parallel commit protocol itself, instead of pre-determining it after refreshing all reads before the parallel commit. Instead of defining a transaction's commit timestamp as the timestamp in its staging record, we would define it as max(staging record timestamp, intent timestamps...). This would allow writes performed after a WAN hop to be pushed by the closed timestamp without that preventing the commit from succeeding. There is a glaring issue though - how do we ensure that the reads which the txn has performed are still valid at this unbounded commit timestamp? In order to ensure this, we proposed than any "global txn" in this model would acquire pessimistic durable shared read locks when refreshing (see the recent docs) instead of performing reads optimistically (also discussed). The pessimistic shared locks ensure that reads remain "valid" until the locks are dropped. So the new parallel commit protocol with an "unbounded commit timestamp" becomes:

isCommitted := staging record written && all exclusive locks acquired (intents) && all shared locks acquired
commitTS    := max(staging record timestamp, lock timestamps...)

This allows us to broadcast the EndTxn(Staging), all writes, and all read validations in parallel, so we're back to a single WAN RTT! As it turns out, this ends up being quite similar to the commit condition in SLOG, though more flexible and also not requiring deterministic execution.

But there's one huge blocker here - observed timestamp. Remember from their correctness conditions that they require the leaseholder of all writes in a transaction to have an HLC clock whose value is larger than the commit timestamp of that transaction. Needing to uphold that guarantee requires expensive communication which pretty much destroys the "unbounded commit timestamp" optimistic global transaction proposal. In fact, it even destroys the less optimal "bounded commit timestamp" optimistic global transaction proposal [1]. Note that the guarantee doesn't actually need to be enforced at the time of txn commit, just before the client is informed of the success of the commit, but it's all the same. No matter what, the current definition of observed timestamps requires an amount of coordination which we can't even uphold today and which will be a severe drag on any attempts at improving our transaction protocol for global deployments.

A possible fix – observe transaction coordinator timestamps

The key issue with this bug is that we're making an incorrect assumption that a clock reading off a leaseholder replica conveys any information about the maximal MVCC timestamp that a value written on that range by a causal ancestor transaction could end up with. This incorrect assumption is why we don't uphold this invariant. It's pretty easy to demonstrate, as this issue does, that the combination of transactions committing at timestamps above those that they wrote some of their intents at and intent resolution occurring asynchronously, a clock reading from a leaseholder replica doesn't tell us anything about the values that do or do not exist in its keyspace.

In fact, given these constraints, the only node that is guaranteed to have a clock above the commit timestamp of a transaction by the time the txn is acknowledged to the client is the transaction's coordinator node (e.g. the gateway). So one possible fix is to re-work the usage of observed timestamps to stop considering the leaseholder serving the request when shrinking uncertainty windows during a scan and instead conditionally ignore committed values in a scan's uncertainty window based on whether the scanning txn has observed a timestamp from the value's transaction coordinator node. If the scanning transaction has observed a timestamp from the value's transaction coordinator node and the observed timestamp was below the value's timestamp, the value is certainly not in the txn's past.

This wouldn't work today though, because it's impossible to tell who the transaction coordination was for a given committed MVCC value. So for this to work, we would need to add this information to roachpb.Value, either as a separate field or as part of a txn ID (see the txn ID = hlc + nodeID proposal).

One major issue here is that this change assumes that a transaction is just as likely to have previously acquired observed timestamps from nodes that served as txn coordinators for values it later runs into as it is to have previously acquired observed timestamps from nodes that are leaseholders for values it later runs into. This is a fairly reasonable assumption in Cockroach today due to the symmetry between SQL gateways and KV nodes. It certainly is true of the single-node cluster case. However, this assumption completely breaks down when we start talking about a split between a compute (SQL/Txn Coordination) tier and a storage (KV) tier. In that world, transactions would never acquire useful observed timestamps because they would never visit the coordination nodes that wrote the values that they conflict with.

This seems like a large enough issue on its own to prevent this fix from being the right one. However, rethinking the transport mechanism for the dissemination of observed timestamps to something more akin to gossip might be able to save this proposal. Or maybe we propagate this information through KV nodes with some kind of vector clock. There would be a lot to decide here.

A second, less invasive fix – remember when intents were written

Assuming we don't want to abandon leaseholder-scoped observed timestamps for some of the reasons listed above, we need to figure out a way for the clock reading on a leaseholder to tell us something about causality again. We do know that at the time that an intent is written, the leaseholder's clock must be equal to or greater than the intent's timestamp. The trouble comes when the intent's txn changes timestamps after the intent is written, commits, moves the intent to a higher timestamp asynchronously during intent resolution, and clobbers the information the value was carrying about a lower-bound on the leaseholder's clock when the intent was written.

So maybe the issue here is that we are conflating the timestamp that an intent was written at (which passed through the leaseholder's hlc synchronously) with the commit timestamp of the value once it is resolved (which did not). As a strawman, we could imagine storing two timestamps on an MVCC key-value: Timestamp and WrittenTimestamp. Timestamp would continue to be stored in the LSM key and WrittenTimestamp would be stored in the associated roachpb.Value. In the common case, these two would be the same, so we could omit the later. But in cases where an intent is resolved to a higher timestamp (either when committed or when pushed), we would leave its WrittenTimestamp where it started while we move its Timestamp forward. If we then used the WrittenTimestamp when considering whether a key falls into a scan's uncertainty window or not instead of the commit timestamp, we'd fix this bug.

There are a lot of questions here, like how this would interact with time-bound iterators. If a committed value could have an arbitrarily low WrittenTimestamp then we might never be able to place a MaxTimestamp limit on the values exposed to scans, even if we could keep placing a MinTimestamp limit. This might be fine or it might influence us to store the WrittenTimestamp in the LSM key and the Timestamp in the value. That appears to have the reverse problem with TBIs though.

EDIT: this last paragraph is incorrect. Even with an arbitrarily low WrittenTimestamp, a scan could still ignore all values with Timestamps above the scan's global uncertainty limit. So the MaxTimestamp limit could be set to the maximum of the scan's ReadTimestamp and its GlobalUncertaintyLimit.

An easier alternative – burn them all

Get rid of observed timestamps altogether. They serve a single purpose - to reduce uncertainty retries. These retries were devastating back before we had the ability to refresh away a change in a txn's read timestamp, but we've come a long way since then. It's possible that if we were just a little better about refreshing away retry errors server-side by passing a "no refresh spans" flag on all requests instead of just on EndTxn requests, we might be able to simply delete observed timestamps in their entirety without it costing us very much.

So for this to work, we would need to add this information to roachpb.Value, either as a separate field or as part of a txn ID (see the txn ID = hlc + nodeID proposal). ... Timestamp would continue to be stored in the LSM key and WrittenTimestamp would be stored in the associated roachpb.Value.

Versioned keys (i.e. those with a timestamp suffix) do not point to a roachpb.Value. Rather, they point to roachpb.Value.RawBytes. This was done so we don't have to decode the roachpb.Value on every read to extract the RawBytes field, but simply pass through the value. Unfortunately, it complicates adding additional metadata to committed values.

Versioned keys (i.e. those with a timestamp suffix) do not point to a roachpb.Value. Rather, they point to roachpb.Value.RawBytes.

Good point. If we want to add something to the value of each kv, we'll need to change the roachpb.Value encoding, which unfortunately is not versioned. We could probably use the tag byte to introduce versioning if we needed to, which would give us the flexibility to change the encoding.

In my opinion, we should only "burn it all" if we also drastically reduce the default max offset (which, btw, I think we should do - except that for some users we're about to run with second-high offsets to support vmotion). When the max offset is high, it's really easy to cause broad scans that overlap with writes to keep restarting until the uncertainty interval passes. The only reason why we don't see this is because our testing is not wide enough; users have seen it since the very early days. In fact, while we're running with high max offsets, I still like proposals around this old thread for using observed timestamps more.

I also gutturally dislike observed timestamps and second all of the reasons mentioned above (dangerous complexity, gets in our way a lot). More so, I've currently convinced myself (and please scrutinize this) that the whole concept is inherently unsound, not just the haphazard way we've built all of it:

Even with the cleanest abstraction and interface (for example, keep the hlc out of it), ultimately we want to be able to go to a node and extract from it a promise that any write we'll later find above the observed timestamp happened after our first visit. For a concrete example:

txn1 starts at ts=90 max=200
txn1 visits n1 (r1234, doesn't matter), get observed timestamp t=100 (accurately reflects latest writes on n1)
txn1 visits n1 again, scan r1 ([a,b)) at ts=100, see a write with ts=110 -> conclude this write was not there when we first visited. All sound so far.

That's how it's supposed to work, but now things get problematic. Imagine step zero that happened before everything else:

causal predecessor txn0 of txn1 wrote on n2 at ts=105 and committed (note that txn1 must see all of txn0's writes)

and now we pick up the main timeline again:

n1 receives snapshot for r2 ([b,c)), gets lease transferred to it
txn1 reads (on n1) [b,c) with max=100, ignores txn0's write (which came in via the snapshot).

There's just no way to fix this - n2 is wholly unconnected to anything happening at n1 until the data moves. We can attach a span from the observed timestamp, but then they also become a lot less useful (and their usefulness unpredictable). Yes, snapshot vs txn durations make this not a likely problem in practice but that's not the bar we have here.

Nathan's proposal to collect observed timestamps from coordinators (I will just call that coordinator-based causality if he lets me) is structurally much more appealing - but it's not fundamentally a "fix" - it's just a better system that sidesteps the problems altogether (and I'd consider it as such). The WrittenTimestamp proposal on the other hand is a "fix", except it wouldn't actually fix the counter-example above.

The problem observed timestamps are addressing is real. I haven't gone through telemetry or user reports on this (what do we have there?) but I'm convinced we can't just rip out observed timestamps without significant fallout. The bar I would set for the replacement is

a) no regression in user-facing uncertainty restarts b) that regressions in internal restarts can be mitigated (either automatically or by the user).

and it looks like we have various options to get there.

Thanks for the detailed comment. I'll respond in full when I'm back at work on Wednesday.

There's just no way to fix this - n2 is wholly unconnected to anything happening at n1 until the data moves.

I believe you are describing the exact situation discussed in https://github.com/cockroachdb/cockroach/issues/23749 and fixed by https://github.com/cockroachdb/cockroach/commit/85d46276a8b0588a423888c90bececa5d5755b14 – forwarding observed timestamps to the lease start time of the range they are being applied to. Is that correct?

Is that correct?

Yep! Have to work on my definition of "unfixable" :-) I still think something like this might come back when we read from followers more, and there won't be a lease to save us. (We can fix it similarly, though, all we need is to communicate a timestamp with the snapshots). Ugh. I just hate all of this, though.

Coming back to this, because we're scheduled to address this in one way or another in the next release.

Having refreshed myself on the problem and the possible solutions, I think I'm ok ruling out the "observe transaction coordinator timestamps" fix. As is noted above: "this [solution] completely breaks down when we start talking about a split between a compute (SQL/Txn Coordination) tier and a storage (KV) tier." Over the last 6 months, we've bought more into this SQL/KV split, so I don't think we should explore options that are incompatible with it.

So this leaves us with two options - either "remember when intents were written" or get rid of observed timestamps entirely.

Not much about the first option has changed since we originally talked about it. It still seems sound, but still seems complex to implement and may come with performance implications. The only new information I have here is that at one point in the past few months, I measured the percent of transactions that committed at a higher timestamp than they started at while running TPC-C. The result was a little surprising – only 0.6% of transactions did so, so only about 0.6% of committed values would have a WrittenTimestamp under this new scheme. But this may change as we continue to drop the closed timestamp.

The second option still remains compelling. It removes a lot of complexity [*] and gets around this bug entirely. But it comes at a performance cost as well, that of additional uncertainty interval restarts. Still, I'm inclined to push on it for three reasons:

it aligns with our intention to improve clock synchronization. The better clock synchronization becomes, the less of an issue this is (as @andreimatei points out in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-586388079). And there might be some low-hanging fruit that we could address in this area with all the time we would save by going with this option. For instance, having familiarized myself more with Kronos, I'm now more comfortable saying that we don't need to do anything crazy sophisticated (e.g. estimating WAN RTT) to get clock uncertainty bounds down to right around the latency diameter of a cluster. Things only become more complex when we start talking about bounding clock uncertainty below communication latency.
observed timestamps aren't applicable to follower reads, and we want to start pushing on those more.
we know that observed timestamps are most effective in single-node clusters. In all other cases, their utility drops off significantly (and proportionally to the cluster size) because they're only effective after a txn gateway has already talked to a KV node at least once. If we're mainly worried about increased uncertainty restarts in single-node clusters, we have another option. We can just disable uncertainty intervals in such cases, as they are not relevant to single-node clusters.

No matter what, I think we need to get a better understanding of how important observed timestamps are to performance in various configurations that we care about. We should start off by running some benchmarks with and without them.

cc. @andreimatei since I think it's going to be the two of us working on this in the next release.

[*] I personally think observed timestamps may be the single most difficult part of the KV layer to understand fully. See https://github.com/cockroachdb/cockroach/pull/51555 for an example of how subtle and non-intuitive they are. Or https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-590451617. Working with them often means constructing a tree of happened-before relations in order to answer subtle questions about causality.

I'm starting to take a look into how important observed timestamps are to various benchmarks (tpcc, each ycsb workload, etc.) and in different cluster setups (single node, multiple nodes, different max clock offsets, etc.). The idea is to determine where observed timestamps are providing a benefit to performance so we can determine whether we would need to offset their removal with other improvements if we decide we want to remove them to solve this bug.

So far, I've only done the most basic of testing, but on our most "representative" workload, they don't seem to make any difference. The "old warehouse" count is with observed timestamps enabled and the "new warehouse" count is with them disabled.

name                      old warehouses  new warehouses  delta
tpccbench/nodes=3/cpu=16      2.17k ± 7%      2.19k ± 6%   ~     (p=0.800 n=18+20)

Removing observed timestamps did pose one issue that was a little delicate to get around. It turns out that observed timestamps play an (accidental?) important role in allowing PushTxn requests to never push contending transactions into the future, which is not allowed for a few reasons. Observed timestamps are important here because, with them, a read that hits an intent only needs to push the contending transaction to the read's observed-timestamp-limited MaxTimestamp value, which ends up always being below present time. Without observed timestamps, the read needs to make sure that conflicting intents are fully out of its original uncertainty interval so that it doesn't run into the intent the next time it scans. Since its MaxTimestamp isn't limited by observed timestamps, it may be in the future.

The way I'm "fixing" this now is to upgrade any PUSH_TIMESTAMP push to a PUSH_ABORT push if its PushTo timestamp is in the future. This sounds incorrect, but it's actually not too much of an abuse, especially because pushes rarely ever succeed anymore. Instead, they're mostly used for distributed deadlock detection. I think we could probably do something a little more correct with the separate lock table. In its finished state, we will be able to remember which active intents (checked using a PUSH_TOUCH push) were above a read's timestamp but in its uncertainty interval and ignore their provisional values during the corresponding scan of the MVCC keyspace.

This explains the request timestamp .. less than PushTo timestamp ... error @tbg saw in https://github.com/cockroachdb/cockroach/issues/49360#issuecomment-646487436.

YCSB doesn't paint quite as pretty of a picture, likely because it has such a large amount of read-write contention.

name                   old ops/s    new ops/s    delta
ycsb/A/nodes=3          16.5k ± 3%   13.4k ±11%   -18.92%  (p=0.000 n=9+10)
ycsb/B/nodes=3          31.8k ± 2%   26.9k ± 5%   -15.24%  (p=0.000 n=9+8)
ycsb/C/nodes=3          44.6k ± 1%   40.1k ± 7%   -10.14%  (p=0.000 n=8+10)
ycsb/D/nodes=3          36.3k ± 2%   33.2k ± 6%    -8.49%  (p=0.000 n=10+10)
ycsb/E/nodes=3          1.29k ± 4%   1.21k ± 9%    -5.64%  (p=0.035 n=10+10)
ycsb/F/nodes=3          7.62k ± 2%   6.24k ± 4%   -18.06%  (p=0.000 n=9+10)
ycsb/A/nodes=3/cpu=32   27.0k ± 5%   23.3k ± 6%   -13.76%  (p=0.000 n=10+10)
ycsb/B/nodes=3/cpu=32   92.8k ± 1%   86.1k ± 3%    -7.23%  (p=0.000 n=8+9)
ycsb/C/nodes=3/cpu=32    116k ± 2%    118k ± 2%      ~     (p=0.063 n=10+10)
ycsb/D/nodes=3/cpu=32   93.7k ± 2%   96.1k ± 2%    +2.57%  (p=0.001 n=10+9)
ycsb/E/nodes=3/cpu=32   1.98k ± 7%   2.04k ± 1%      ~     (p=0.089 n=10+10)
ycsb/F/nodes=3/cpu=32   14.4k ± 9%   11.0k ±10%   -23.71%  (p=0.000 n=10+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           5.84 ± 1%    7.20 ±10%   +23.34%  (p=0.000 n=8+10)
ycsb/B/nodes=3           4.54 ± 4%    5.34 ± 4%   +17.57%  (p=0.000 n=10+8)
ycsb/C/nodes=3           3.20 ± 0%    3.59 ± 9%   +12.19%  (p=0.000 n=7+10)
ycsb/D/nodes=3           2.63 ± 3%    2.87 ± 5%    +9.00%  (p=0.000 n=10+9)
ycsb/E/nodes=3           74.6 ± 5%    79.3 ± 9%    +6.27%  (p=0.043 n=10+10)
ycsb/F/nodes=3           12.7 ± 3%    15.4 ± 5%   +21.56%  (p=0.000 n=10+10)
ycsb/A/nodes=3/cpu=32    5.33 ± 4%    6.18 ± 7%   +15.95%  (p=0.000 n=10+10)
ycsb/B/nodes=3/cpu=32    2.10 ± 0%    2.23 ± 3%    +6.35%  (p=0.000 n=8+9)
ycsb/C/nodes=3/cpu=32    1.64 ± 4%    1.60 ± 0%    -2.44%  (p=0.046 n=10+8)
ycsb/D/nodes=3/cpu=32    1.53 ± 5%    1.50 ± 0%      ~     (p=0.173 n=10+9)
ycsb/E/nodes=3/cpu=32    72.8 ± 7%    70.6 ± 1%      ~     (p=0.100 n=10+10)
ycsb/F/nodes=3/cpu=32    10.0 ± 9%    13.1 ±10%   +31.30%  (p=0.000 n=10+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           47.2 ±24%    69.8 ±38%   +48.07%  (p=0.001 n=10+10)
ycsb/B/nodes=3           33.4 ±19%    41.9 ±10%   +25.56%  (p=0.000 n=10+9)
ycsb/C/nodes=3           13.6 ± 0%    15.2 ± 3%   +11.76%  (p=0.000 n=7+9)
ycsb/D/nodes=3           13.8 ± 3%    14.9 ± 5%    +8.49%  (p=0.000 n=10+10)
ycsb/E/nodes=3            324 ± 4%     347 ±11%    +7.25%  (p=0.012 n=10+10)
ycsb/F/nodes=3            108 ±20%     331 ±62%  +205.66%  (p=0.000 n=9+10)
ycsb/A/nodes=3/cpu=32    73.4 ± 9%   108.2 ±12%   +47.44%  (p=0.000 n=10+10)
ycsb/B/nodes=3/cpu=32    14.2 ± 5%    19.8 ±15%   +39.08%  (p=0.000 n=9+9)
ycsb/C/nodes=3/cpu=32    7.90 ± 0%    7.90 ± 0%      ~     (all equal)
ycsb/D/nodes=3/cpu=32    8.20 ± 4%    8.03 ± 2%      ~     (p=0.079 n=10+9)
ycsb/E/nodes=3/cpu=32     231 ± 9%     222 ± 2%      ~     (p=0.142 n=10+10)
ycsb/F/nodes=3/cpu=32     155 ±24%     482 ±23%  +211.12%  (p=0.000 n=10+10)

We see decent size drops across all workloads. I suspect that in all but YCSB-E (with scans), this has more to do with the impact observed timestamps have on contention handling (discussed in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-714221846) than it does with the real impact they have on avoiding uncertainty interval restarts. In other words, I don't necessarily think we need to keep observed timestamps to avoid this fallout, but we do need to have a way for reads that have waited on a lock to proceed once the lock is removed and replaced at any higher timestamp, instead of only if it is replaced at a timestamp all the way out of its original uncertainty interval.

Put another way, we need to find a way around the behavior:

txn1 (writer): acquire lock k1 @ 10
txn2 (writer): enqueue on lock k1 @ 15
txn3 (reader): enqueue on lock k1 @ 12 with maxTimestamp @ 20

txn1: commit and release lock
txn2: replace lock @ 15
txn3: continue waiting on lock because lock is still in uncertainty interval, even though it is above the read's timestamp

It's not exactly clear to me how we can avoid this behavior, especially without the separate lock table and a lockAwareIterator structure that we can attach some memory to and reference when scanning and noticing an intent in the scan's uncertainty interval. However, it is clear that there is no fundamental reason why the reader txn3 would need to consider the second writer txn2 to be uncertain.

These results also look a little suspicious for the 8 vCPU cases. We see that YCSB-C has a regression. But YCSB-C is read-only, so observed timestamps should be irrelevant. Either I messed something up, I'm not thinking about this correctly, or there was something up with the hardware I used to gather these results. The 32 vCPU cases make a lot more sense. They still show a regression, but only on workloads where we would expect such a regression.

@nvanbenschoten following up from our friday conversation, some questions/comments:

The YCSB behavior you outlined with the example involving {txn1,txn2,txn3}. How did you pinpoint this as the dominant issue when not having observed timestamps, versus txn2 already holding the lock when txn3 arrived in the queue? I am assuming you have some instrumentation -- can you point me to it?
I am worried that we may be over-generalizing for all our customer workloads based on looking at TPCC and YCSB. In theory one can easily construct pathological cases where observed timestamps would be very helpful (and not just for the preceding YCSB behavior), so maybe some customers are seeing significant benefit. Can we add instrumentation to our CC build to approximate the impact?
Regarding correctness of observed timestamps: the ideal is that when txn1 starts, to have a fairly tight upper bound on the time at all nodes, that is close to the time t assigned to txn1. Since we don't have that, we are sampling what is in the past on a node-by-node basis when first visiting a node, to come up with this fairly tight boumd. Except that the visiting does not cover all nodes and we need that coverage to account for txn refreshes. Correct?
- It seems that this is not an issue for follower reads of future writes done by non-blocking transactions, since there we will ensure that the uncertainty period is closed, and we can make the writer wait out a longer period when returning (if it decidest to refresh). I am assuming that is our primary scenario for follower reads. Is that correct?
- For follower reads at nodes that have low latency to the leaseholder (done for load-balancing), or generally to avoid the decreasing benefit of observed timestamps as we increase the nodes that are within a low latency diameter, we could have every node periodically (say every 1ms) exchange an RPC with every other node. Txn1 with coordinator n1 that has not yet read from another node n2 could initialize its observed timestamps for n2 based on the first RPC roundtrip from n1=>n2 after txn1 started. On re-reading this issue, I realized that Andrei already proposed something like this in https://groups.google.com/forum/#!msg/cockroach-db/izZ0yV_VDqk/aM4mjq98BgAJ and I am guessing you were alluding to a more refined version of this when talking about gossip and vector clocks in an earlier comment.
- Coming back to reads at the leaseholder for regular transactions, and problems created by txn1, coordinated at n1, committing at time t, for a reader at n2. Either based on the RPC when writing the intent, or based on the periodic RPC from n1=>n2, we can place a lower bound on n2's clock at a certain clock time of n1 (when the RPC response was received). I am assuming low latency between (n1, n2), since the high latency case is dealt with in non-blocking transactions. Say this pair (at n1) is (t_n1, t_n2) and the transaction has refreshed and is trying to commit at t > t_n1 where t - t_n1 is small (due to the periodic RPCs and low latency). If we could put an upper bound on the clock rate difference of n1 and n2, which would put a lower bound on the clock rate of n2, we could make txn1 wait until we are sure that n2's clock is >= t. I am wondering whether this becomes an easier guarantee to extract from a clock synchronization algorithm, since it is over short time intervals, versus the general clock synchronization problem.

The YCSB behavior you outlined with the example involving {txn1,txn2,txn3}. How did you pinpoint this as the dominant issue when not having observed timestamps, versus txn2 already holding the lock when txn3 arrived in the queue? I am assuming you have some instrumentation -- can you point me to it?

The example didn't come from instrumentation, it just came from thinking through why a workload like YCSB-A would slow down without observed timestamps. YCSB-A contains two types of operations - UPDATEs and SELECTs. The UPDATEs issue an SFU locking Scan followed by a Put. Neither of these operations is affected by observed timestamps because they both hold write latches and throw WriteTooOld errors if an existing value has a higher timestamp. So only the SELECT should be impacted by this change. The SELECT issues a single non-locking Scan operation, so it should never actually carry observed timestamps over to a second request. So the issue must be related to the fact that observed timestamps allow a request to bound its MaxTimestamp to hlc.Now() upon reaching a leaseholder. That's what I meant by play an (accidental?) important role – observed timestamps are helping, but not quite in the way they were meant to.

You bring up a very valid point, though. We should confirm through experimentation that this is exactly what's going on before trying to figure out how to fix it.

I am worried that we may be over-generalizing for all our customer workloads based on looking at TPCC and YCSB. In theory one can easily construct pathological cases where observed timestamps would be very helpful (and not just for the preceding YCSB behavior), so maybe some customers are seeing significant benefit. Can we add instrumentation to our CC build to approximate the impact?

Yes, it's very possible that we are over-indexing the impact of this change on these workloads. These are workloads that we consider "representative" and so we care a lot about them, but they're not the only workloads that we care about. That said, YCSB is just about as contended of a workload as we see people running, and so I'd feel pretty good about a fix to this bug if it had little to no impact on YCSB, given that observed timestamps only affect contended workloads.

We do export a txn.restarts.readwithinuncertainty metric that we could look at to determine how many uncertainty interval errors our customers are currently hitting. I don't know how to extrapolate to that to how many they would hit with this change without more instrumentation.

I also don't know how to get access to metrics in CC. This is something we'll also want for an investigation into clock synchronization in CC. @petermattis do you have any thought about how we'd be able to get anonymized, aggregated metrics from CC into the hands of database engineers? This feedback loop is one of the major theoretical benefits of running CRDB at scale ourselves, but so far, we haven't taken advantage of the opportunity.

Regarding correctness of observed timestamps: the ideal is that when txn1 starts, to have a fairly tight upper bound on the time at all nodes, that is close to the time t assigned to txn1. Since we don't have that, we are sampling what is in the past on a node-by-node basis when first visiting a node, to come up with this fairly tight boumd. Except that the visiting does not cover all nodes and we need that coverage to account for txn refreshes. Correct?

Yes, that's mostly correct. Except that for various reasons including the one discussed in this issue, a node's clock does not actually place an upper bound on the committed or provisional values in its leaseholders' keyspaces.

I mention provisional values because it appears that intents are also subject to a variant of this bug. A PushTxn(PUSH_TIMESTAMP) could move an intent to a timestamp above a leaseholder's clock and invalidate previous observed timestamps. This means that if we decided to go with the WrittenTimestamp fix, we'd need to include this information on intents as well.

It seems that this is not an issue for follower reads of future writes done by non-blocking transactions, since there we will ensure that the uncertainty period is closed, and we can make the writer wait out a longer period when returning (if it decidest to refresh). I am assuming that is our primary scenario for follower reads. Is that correct?

If by not an issue, you mean that observed timestamps do not apply and cannot be used in these cases, then yes, you are correct. We can only use observed timestamps to bound the uncertainty interval when reading on the leaseholder of a range. We will also not be able to use them to bound the uncertainty interval for writes in the future done by non-blocking transactions (which is another reason why I'm motivated to get rid of them).

For follower reads at nodes that have low latency to the leaseholder (done for load-balancing), or generally to avoid the decreasing benefit of observed timestamps as we increase the nodes that are within a low latency diameter, we could have every node periodically (say every 1ms) exchange an RPC with every other node. Txn1 with coordinator n1 that has not yet read from another node n2 could initialize its observed timestamps for n2 based on the first RPC roundtrip from n1=>n2 after txn1 started. On re-reading this issue, I realized that Andrei already proposed something like this in https://groups.google.com/forum/#!msg/cockroach-db/izZ0yV_VDqk/aM4mjq98BgAJ and I am guessing you were alluding to a more refined version of this when talking about gossip and vector clocks in an earlier comment.

Right, though these kinds of proposals are all flawed to some degree because only observed timestamps captured after a txn starts are applicable. You mentioned this in "based on the first RPC roundtrip from n1=>n2 after txn1 started". They also still require us to fix the bug here.

Coming back to reads at the leaseholder for regular transactions, and problems created by txn1, coordinated at n1, committing at time t, for a reader at n2. Either based on the RPC when writing the intent, or based on the periodic RPC from n1=>n2, we can place a lower bound on n2's clock at a certain clock time of n1 (when the RPC response was received). I am assuming low latency between (n1, n2), since the high latency case is dealt with in non-blocking transactions. Say this pair (at n1) is (t_n1, t_n2) and the transaction has refreshed and is trying to commit at t > t_n1 where t - t_n1 is small (due to the periodic RPCs and low latency). If we could put an upper bound on the clock rate difference of n1 and n2, which would put a lower bound on the clock rate of n2, we could make txn1 wait until we are sure that n2's clock is >= t. I am wondering whether this becomes an easier guarantee to extract from a clock synchronization algorithm, since it is over short time intervals, versus the general clock synchronization problem.

This is effectively Spanner's CommitWait protocol, isn't it? Except you're bounding it to only concern itself with clocks on nodes that a transaction performed a write on? Is that what you mean by extracting this from the general clock synchronization problem?

If by not an issue, you mean that observed timestamps do not apply and cannot be used in these cases, then yes.

I additionally meant that trying to construct an observed timestamp-like mechanism is unnecessary for reads of such future writes, since one can prevent false uncertainty by making the writer wait longer.

This is effectively Spanner's CommitWait protocol, isn't it? Except you're bounding it to only concern itself with clocks on nodes that a transaction performed a write on? Is that what you mean by extracting this from the general clock synchronization problem?

Yes, it is a variant of Spanner's CommitWait. In addition to the narrower set of nodes, it also tries to reduce the problem to a shorter interval which reduces the absolute value of the extra wait (which is where the periodic background RPCs come in). For example, say RTT between n1 and n2 was 20ms, and the last RPC that observed n2's clock was sent at time 500 and response was received at 520 with a value of 510. And txn1 is trying to commit at 521. If we assume a lower bound on the clock rate of n2 that is 0.9 of the clock rate of n1, then it needs to wait until 520 + (521-510)/0.9 = 532.2. Not great, but it is only for a transaction that has refreshed. If instead the last RPC that observed n2'c clock was sent at time 0 and response was received at 20 with a value of 10, then the same 0.9 clock rate would result in a wait until 20 + (521-10)/0.9 = 587.7 which is much worse. The guarantee that I wanted to extract, instead of solving the general clock synchronization problem, is this clock rate lower-bound guarantee over short time time intervals. And the question was whether this was easier than solving the general synchronization problem.

I additionally meant that trying to construct an observed timestamp-like mechanism is unnecessary for reads of such future writes, since one can prevent false uncertainty by making the writer wait longer.

Well, kind of. This is only true if we accept non-monotonic reads, as we've been discussing over in https://github.com/cockroachdb/cockroach/pull/52745.

Yes, it is a variant of Spanner's CommitWait. In addition to the narrower set of nodes, it also tries to reduce the problem to a shorter interval which reduces the absolute value of the extra wait (which is where the periodic background RPCs come in). For example, say RTT between n1 and n2 was 20ms, and the last RPC that observed n2's clock was sent at time 500 and response was received at 520 with a value of 510. And txn1 is trying to commit at 521. If we assume a lower bound on the clock rate of n2 that is 0.9 of the clock rate of n1, then it needs to wait until 520 + (521-510)/0.9 = 532.2. Not great, but it is only for a transaction that has refreshed. If instead the last RPC that observed n2'c clock was sent at time 0 and response was received at 20 with a value of 10, then the same 0.9 clock rate would result in a wait until 20 + (521-10)/0.9 = 587.7 which is much worse. The guarantee that I wanted to extract, instead of solving the general clock synchronization problem, is this clock rate lower-bound guarantee over short time time intervals. And the question was whether this was easier than solving the general synchronization problem.

I see what you're saying now. So the idea is that we get a clock reading from each leasesholder written to over the course of a transaction during the normal course of action and we use this to place an upper bound on clock skew between the leaseholder and the txn coordinator. We then assume a bounded clock drift with respect to the txn coordinator and so we know how long we need to wait during a CommitWait stage to ensure that all leaseholders have clocks in excess of the commit timestamp.

That's a very cool idea! It's especially appealing because it's only needed when a transaction has refreshed. My main concern with it is that like Kronos, the clock skew bound can't drop below the response time between the leaseholder and the gateway. This means that transactions that refresh will need to wait out 1/2 RTT * clock_drift, which could get quite large in multi-region clusters. Like other instances of CommitWait, this wait will be concurrent with any work that takes place after a transaction's commit timestamp has been determined, so the "real wait" should be lower, but still not free in some cases.

My other concern is that this makes the observed timestamp interactions even more complex. We have a hard time explaining what an observed timestamp means now, and it would be even harder if they relied on a CommitWait-like procedure for correctness. I think their meaning would shift from something like

an observed timestamp places an upper bound on the timestamps of the values on a node, such that any value observed at a higher timestamp on the node must have been written after the observed timestamp was captured

an observed timestamp places an upper bound on the timestamps of the values on a node, such that any value observed at a higher timestamp on the node must have been written by a transaction that was not causally related to the current observer

Or something like that. That's pretty tricky to think through.

@petermattis I have been thinking back to your comment in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-586358458. If we did want to revive observed timestamps, I think the current best option is the "remember when intents were written" option. This effectively maintains some sequencing information on committed values that lost theirs during intent resolution so that their relationship to observed timestamps would be retained. I'm interested in whether you have suggestions for how you would store this additional WrittenTimestamp information in KVs. Would you extend the roachpb.Value.RawBytes encoding? Or would you put the additional timestamp in the key?

With this approach, transactions would need to remember both their "original MaxTimestamp" and their "limited-by-observed-timestamps MaxTimestamp" during evaluation (instead of overwriting the former with the latter). A scan could ignore all values above its "original MaxTimestamp". Similarly, any value between the scan time and the "limited-by-observed-timestamps MaxTimestamp" would be uncertain. For values between these two MaxTimestamps, the scan would need to check if the value has a WrittenTimestamp. If not, the value would not be uncertain. If so, the scan would check if the WrittenTimestamp is above or below the "limited-by-observed-timestamps MaxTimestamp" to determine whether the value is uncertain.

A slightly modified approach that allows us to maintain less state is to simply remember whether a committed value was moved by intent resolution or not. So instead of maintaining a WrittenTimestamp hlc.Timestamp, we would maintain a TimestampDidNotHappenBeforeCommit bool. With this approach, a scan would act almost the same as discussed in the previous paragraph, except that for values between the two MaxTimestamps, the scan would just check this bool. If set, the value would be uncertain.

If we only needed to maintain a bool, it would be even more compelling to store this information in the KV's key. I explored whether this TimestampDidNotHappenBeforeCommit concept could be unified with the synthetic timestamp concept in https://github.com/cockroachdb/cockroach/pull/52745 so that we could re-use the bit we're going to reserve in the timestamp's logical field. That would actually mostly work because values with synthetic timestamps won't be subject to observed timestamps for uncertainty interval purposes, but this seems more like an accident than a real unification.

If we only needed to maintain a bool, it would be even more compelling to store this information in the KV's key. I explored whether this TimestampDidNotHappenBeforeCommit concept could be unified with the synthetic timestamp concept in #52745 so that we could re-use the bit we're going to reserve in the timestamp's logical field. That would actually mostly work because values with synthetic timestamps won't be subject to observed timestamps for uncertainty interval purposes, but this seems more like an accident than a real unification.

This may not be such a crazy idea after all. As discussed in #52745, a synthetic timestamp is one that makes "no claim about the value of the clock that it came from". As such, observed timestamps cannot be used to ignore a value with a synthetic timestamp in a read's uncertainty interval, because even if the value's timestamp is above the observed timestamp, we cannot make a claim that the reading txn and the writer of the value were concurrent or that the write cannot have a happened-before relation with the read. This also means that nodes cannot forward their HLC clocks to synthetic timestamps to avoid waiting during a ReadWithinUncertaintyInterval error, but this might not be as big of a deal as I originally thought (see below).

Things are a little different with this issue, but we're once again in a case where the MVCC timestamp of a committed value makes no claims [*] about the clock of the leaseholder at the time that the value's transaction was committed and acknowledged. So once again, observed timestamps cannot be compared to this MVCC timestamp to ignore the value if it would otherwise be in a read's uncertainty interval. So in some ways, marking the timestamp of committed values that are moved during intent resolution as "synthetic" isn't even a semantic abuse.

I brought up the fact that nodes cannot forward their HLC clocks to synthetic timestamps to avoid waiting during a ReadWithinUncertaintyInterval error, but this no longer seems to be a problem with this new proposal. For transactions that don't commit in the future (see non-blocking transactions), their commit timestamp will be pulled from some clock, so as long as the HLCs are propagated during intent resolution and later on the RWUI error after observing one of these "forwarded values" in its uncertainty interval, I don't think we'll ever get in a case where a transaction needs to wait before refreshing, because it's clock should always be above the synthetic timestamp of the value.

[*] It actually does make some claim, but just the leaseholder and all other nodes in the cluster could not have been lagging the commit timestamp by more than max_offset. This is, of course, why uncertainty intervals work in the first place.

It looks like this issue was incorrectly closed by the commit message in 11abdda

Thanks for catching that. This wasn't supposed to be closed. I guess GitHub got the wrong idea from the sentence:

This seemed difficult to get right, and seemed particularly concerning since we're planning on marking only some of a transaction's committed values as synthetic to fix #36431

A second, less invasive fix – remember when intents were written

Assuming we don't want to abandon leaseholder-scoped observed timestamps for some of the reasons listed above ...

I did some digging to understand how other systems with similar architectures to CRDB approach this issue, if at all. It turns out that this "remember when intents were written" approach is exactly what YB does. But they then jump through some serious hoops (see github.com/yugabyte/yugabyte-db/issues/4535) to avoid permanently bloating their LSM values with this second timestamp. This additional timestamp recording when an intent was written is referred to as the intent_doc_ht_ in the code.

Getting this right and making it efficient is very important for YB and, I imagine, Spanner, because their transaction/replication model results in effectively all provisional values being moved to higher timestamps during resolution. This is due to their "closed timestamp" equivalent, which bumps all writes to new timestamps, so it's very rare (maybe not possible) for a cross-range transaction to commit at the timestamp that it original wrote values at. This is not quite the case for Cockroach, where our transaction/replication model allows for most short-lived transactions to commit at their original provisional commit timestamp. We saw this before in our TPC-C testing:

I measured the percent of transactions that committed at a higher timestamp than they started at while running TPC-C. The result was a little surprising – only 0.6% of transactions did so

Reflecting on this, I'm satisfied with the solution laid out in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-719157010. It's less optimal from the perspective of minimizing uncertainty restarts under contention, but it's simpler than conditionally storing an entirely new timestamp in MVCC values, cheaper from a storage perspective, and composes with the work we already had to do for non-blocking transactions in https://github.com/cockroachdb/cockroach/pull/56373 and https://github.com/cockroachdb/cockroach/pull/57077. The remaining work item is setting this synthetic bit during intent resolution when appropriate.

I read back through the synthetic timestamp upon re-written intent idea, it still seems sound. One note is that we probably want to have all of the writes due to a single transaction carry the same timestamp on disk, including the synthetic bit. Given that, my thinking is that we should use the synthetic bit for all writes due to a transaction which rewrote any intents rather than just for the intents which were rewritten. This has the negative consequence that readers are more likely to encounter these more disruptive synthetic timestamps. I can be convinced otherwise, but, at least for an initial implementation, it seems like a hard sell to have writes due to a single transaction carrying different timestamps.

cockroachdb / cockroach