storage: SNAPSHOT isolation violation with write timestamp cache

bdarnell commented 6 years ago

I think the write timestamp cache is currently implemented incorrectly and, in conjunction with the RefreshSpans mechanism, can lead to serializability violations. It is also incorrect for snapshot isolation transactions even before the introduction of RefreshSpans. I haven't reproduced either error yet; I'm just writing up my notes so far.

The problematic code is in applyTimestampCache:

https://github.com/cockroachdb/cockroach/blob/7c8d6f63928ed0169d361dfd63d70a43f4b24411/pkg/storage/replica.go#L2348-L2379

We push the transaction's timestamp based on both the read and write timestamp caches. If the write timestamp cache had the higher value, we also set the WriteTooOld flag on the transaction. This means that the presence or absence of a read can change the outcome of the WriteTooOld flag.

Pre-2.0, the WriteTooOld flag was mostly optional for serializable transactions: they had to restart when their timestamp was pushed whether this flag was set or not. However, with RefreshSpans, a serializable transaction without WriteTooOld could issue its Refresh requests, see no new individual write intents, allowing it to commit at the new timestamp even though it has jumped past a write. (For snapshot transactions, the problem is more straightforward: the WriteTooOld flag directly determines whether a pushed snapshot transaction must restart. I haven't examined whether this could be the cause of #23176 yet).

I think the following set of transactions would demonstrate the error:

INSERT INTO kv VALUES ($1, 1) ON CONFLICT DO UPDATE SET v = v+1
DELETE FROM kv
SELECT * FROM kv

The INSERT ON CONFLICT becomes a read-modify-write, so we can expand that into two possible orderings:

Read for INSERT
DELETE
Write for INSERT. The WriteTooOld flag is set and the transaction restarts.
Read for INSERT, seeing the outcome of the delete.
Write for INSERT

or

Read for INSERT
DELETE
SELECT
Write for INSERT. The timestamp gets pushed but the WriteTooOld flag is not set
Refresh for INSERT
INSERT commits at a timestamp after the DELETE, with value reflecting its read from before the DELETE.

I think the timestamp cache checks need to be updated to set the WriteTooOld flag if the OrigTimestamp is less than the write timestamp cache value (in addition to pushing).

We also need to add deletions to our jepsen tests to cover cases like this.

bdarnell commented 6 years ago

cc @spencerkimball

nvanbenschoten commented 6 years ago

This all seems to check out and the example you provide appears to be possible. We bump the BatchRequest timestamp during the read timestamp cache check and then compare against it in the write timestamp cache check, so the first part could certainly have an effect on the second.

To fix this we could pull ba.Txn.Timestamp out before performing either checks or we could perform the write ts cache check first. However, from the end of this issue, it sounds like even this is not enough - we should have been comparing the write timestamp cache against the OrigTimestamp all along. This checks out with me, although I don't feel very confident about understanding the difference between OrigTimestamp and RefreshedTimestamp and whether the latter should be in play here at all.

spencerkimball commented 6 years ago

I agree it's a bug that we're not setting the WriteTooOld flag if the timestamp has already been forwarded by the read timestamp cache.

However, I believe the WriteTooOld flag is still not required by SERIALIZABLE transactions. The refresh spans take care of the exact case that the WriteTooOld flag is meant to handle with SNAPSHOT: namely the lost update anomaly. In other words, we want to avoid blithely writing a key which was previously read in our transaction at a timestamp earlier than the most recent committed write. We need to restart in this case, but this is something that the refresh spans will handle.

There's nothing inherently wrong with simply writing at higher timestamps and letting the WriteTooOld flag go to hell. If nothing has been read during the transaction, that's completely legit, even for SNAPSHOT isolation – an optimization we never made. What we can't allow is writing over (@t=2) a newer version (@t=1), where we've read at an earlier timestamp (@t=0) as part of the same transaction. If checking the refresh spans turns up no later write, then we can proceed.

Losing the WriteTooOld flag for SERIALIZABLE changes nothing because our transaction timestamp advances, so we're forced to either check our refresh spans or retry.

bdarnell commented 6 years ago

The question is whether checking the refresh spans is sufficient. I was thinking that it wasn't because Refresh doesn't have the consultsTSCache flag, so it only sees writes/intents to the same keys, not the gap covered by the DeleteRange's timestamp cache entry. However, to capture our offline conversation, if the DeleteRange deleted a relevant key, it would leave an individual intent. The timestamp cache only matters when the DeleteRange is covering vacant keyspace, in which case the DeleteRange is a no-op (on the keys that matter) and it's only important (for serializable transactions) in that it pushes the timestamp forward.

So we have a bug for SNAPSHOT transactions, but not for SERIALIZABLE, making this a much lower-priority issue.

bdarnell commented 6 years ago

Snapshot isolation is no more, so there's nothing else to do here.

cockroachdb / cockroach

storage: SNAPSHOT isolation violation with write timestamp cache #26460