storage: 1PC transactions can be applied twice

bdarnell commented 8 years ago

When a 1PC transaction is retried due to a network failure, the second attempt may return an error even if the first attempt succeeded (and the successful result was masked by the network failure). The most likely error to be seen in this case is WriteTooOldError (caused by the transaction seeing its own past write, which by now has been committed, resolved, and scrubbed of its transaction ID), which is a transactionRestartError. Upon restart, the transaction gets a new timestamp and may perform different writes, which may succeed on a subsequent attempt, leading to the same statement applying twice.

This is the more insidious cousin of #6053 and #7604. Those issues are about the failure leaking back to the client; this one is about the fact that if the operation is retried in some situations, it could succeed.

Here is one concrete case in which the error can occur:

A SQL table has a primary key whose default value is auto-generated (e.g. the default rowid column) and has no secondary indexes
An insert to this table is performed as a single auto-committed statement (allowing the 1PC optimization)
The network fails after the write is proposed. DistSender gets an RPC error and retries.
The initial write succeeds, but there is no one listening so its response goes nowhere. It's a 1PC transaction so it cleans up its intent synchronously.
The retried write fails with a WriteTooOldError
This error is raised to client.Txn, which is running in AutoRetry mode, so it retries this transactionRestartError. (note that AutoRetry is not the problem here - if the error continued on to the client, the client's proper response to a WriteTooOldError is to retry)
This runs the SQL transaction closure, which generates a new row id, avoids the conflict, and commits.

The response cache (removed in #3077 because it was too expensive and retries of non-transactional requests were deemed less important. The 1PC optimization makes its requests effectively non-transactional at the KV layer) handled this by remembering the response so the retry would get the same response.

A few possible solutions that don't require removing the 1PC optimization or bringing back the response cache:

Distinguish between RPC errors that definitely won't succeed (e.g. connection refused) and those where the outcome is uncertain. When a 1PC transaction gets the latter error, return it to the client as an ambiguous result (ambiguous results are always possible in the postgres protocol if a network failure occurs during a commit)
Remember all committed transaction ids for a time in something like the abort cache. This would be lighter than the full response cache and would only allow us to tell that this scenario has occurred and return a distinct error code for it, but not recover what exactly happened on the earlier attempt.
At the SQL layer, perform a separate query to figure out what happened. This requires specific knowledge of the behavior of each type of query that can benefit from the 1PC optimization.

petermattis commented 8 years ago

Distinguish between RPC errors that definitely won't succeed (e.g. connection refused) and those where the outcome is uncertain.

This seems error prone as the code evolves.

At the SQL layer, perform a separate query to figure out what happened.

Ditto.

Remember all committed transaction ids for a time in something like the abort cache. This would be lighter than the full response cache and would only allow us to tell that this scenario has occurred and return a distinct error code for it, but not recover what exactly happened on the earlier attempt.

1PC transactions are only allowed for write-only operations, correct? So what do we need to recover other than whether the transaction committed or not?

I'm not seeing what about this problem is specific to 1PC transactions. If a KV batch contains a Put and an EndTransaction and we cleanup the transaction record, replay of the request (Put + EndTransaction) can fail with a WriteTooOldError and get propagated back up to the SQL layer. Is there something that protects us in this scenario?

bdarnell commented 8 years ago

Distinguish between RPC errors that definitely won't succeed (e.g. connection refused) and those where the outcome is uncertain.

This seems error prone as the code evolves.

This one doesn't seem that bad to me. We just need two kinds of RPC errors instead of one, and use one for getting an rpc connection and one for an error sending the request. The question here is whether that is precise enough for us or does the GRPC abstraction prevent us from seeing the distinction we want.

At the SQL layer, perform a separate query to figure out what happened.

Ditto.

Yep, this seems very tricky and fragile.

Remember all committed transaction ids for a time in something like the abort cache. This would be lighter than the full response cache and would only allow us to tell that this scenario has occurred and return a distinct error code for it, but not recover what exactly happened on the earlier attempt.

1PC transactions are only allowed for write-only operations, correct? So what do we need to recover other than whether the transaction committed or not?

It's legal at the KV layer to include a Get in a 1PC transaction, although we don't currently do it. We do use ConditionalPut, and we need to distinguish ConditionFailedError from other kinds of errors. (ConditionFailedError returns the actual value encountered, although again we don't currently use this)

I'm not seeing what about this problem is specific to 1PC transactions. If a KV batch contains a Put and an EndTransaction and we cleanup the transaction record, replay of the request (Put + EndTransaction) can fail with a WriteTooOldError and get propagated back up to the SQL layer. Is there something that protects us in this scenario?

AutoRetry is unique to 1PC transactions, although if the retry happens at the application level we could still have a problem. However, we accumulate WriteTooOldErrors in transactional requests until the EndTransaction, and a replayed EndTransaction results in a TransactionStatusError (which is not retryable). I'm not completely sure whether we have a problem for multi-phase transactions or not.

petermattis commented 8 years ago

This one doesn't seem that bad to me. We just need two kinds of RPC errors instead of one, and use one for getting an rpc connection and one for an error sending the request. The question here is whether that is precise enough for us or does the GRPC abstraction prevent us from seeing the distinction we want.

How would this be used? Would you propagate the error all the way back to the client in the case of an unknown transaction disposition?

(ConditionFailedError returns the actual value encountered, although again we don't currently use this)

I thought we used this in formatting the error when writing to a unique index, but I can't find any code in sql-land which needs the actual value.

I'm not completely sure whether we have a problem for multi-phase transactions or not.

I think a first step is to try and write some tests to replicate this problem. The 1PC issue seems obvious and easily testable. And we could see if multi-phase transactions have a related problem.

bdarnell commented 8 years ago

How would this be used? Would you propagate the error all the way back to the client in the case of an unknown transaction disposition?

Yeah, I think so. If a request that does not contain EndTransaction ends with an unknown disposition, the internal/client.Txn can simply retry. If this happens with an EndTransaction (including 1PC), we have to report it all the way back to the end client.

petermattis commented 8 years ago

If a request that does not contain EndTransaction ends with an unknown disposition, the internal/client.Txn can simply retry.

Ok, maybe this isn't as fragile as I was imagining. @tamird Do we get a good signal from GRPC for when an RPC failed because the remote is down (and definitely didn't receive the RPC) vs other errors?

bdarnell commented 8 years ago

We definitely get that signal from our own circuit breaker. We might have to err on the side of caution for the request that trips the breaker and consider it uncertain if we can't tell for sure what grpc's error means.

petermattis commented 8 years ago

Cc @cockroachdb/stability

aphyr commented 8 years ago

How would this be used? Would you propagate the error all the way back to the client in the case of an unknown transaction disposition?

From the perspective of a client implementer and user, I think it's a great idea to classify every error in your client as known-failed vs indeterminate--maybe via a type (KnownFailure vs MaybeFailure) or a function (indeterminate? err). This is something every application developer has to worry about, so it makes sense to propagate all the way from internals to users. :)

andreimatei commented 8 years ago

AutoRetry is unique to 1PC transactions, although if the retry happens at the application level we could still have a problem. However, we accumulate WriteTooOldErrors in transactional requests until the EndTransaction, and a replayed EndTransaction results in a TransactionStatusError (which is not retryable). I'm not completely sure whether we have a problem for multi-phase transactions or not.

Note that TxnExecOptions.AutoRetry is not unique to 1PC txns - we also use it to retry a SQL txn consisting of a whole batch of statement when we know that the client logic is not conditional on reads. But I think what you said after is spot on: the EndTransaction is key - it allows us to observe the state of the transaction (and a missing state is also a non-retryable error). Anything but 1PC txns will have EndTransactions. So why do you think we might have a problem?

@aphyr, I think the philosophy is that, from a client's perspective, network errors on a COMMIT (or a batch of statements containing the COMMIT) are always ambiguous. And so are they for "implicit transactions" - statements outside of a BEGIN... COMMIT. Except in the case of this bug, of course, we fail to return such an error to the client. I think we discussed at some point having an explicit "ambiguous commit" error specifically for our RELEASE and COMMIT statements, but I think we dropped that idea.

spencerkimball commented 8 years ago

I assume this happened in practice with Jepsen testing? @bdarnell, I wouldn't expect a WriteTooOldError for the case you outline. The successful EndTransaction call puts a marker in the timestamp cache which will result in the replayed transaction erroring out immediately with TransactionReplayError on the receipt of the next BeginTransaction using the same transaction ID. This is not a retryable error.

petermattis commented 8 years ago

We strip the begin/end txn requests in Replica.executeWriteBatch for 1PC transactions before executing them. Does that affect what gets added to the timestamp cache?

tbg commented 8 years ago

The timestamp cache is populated at a higher level so it would see the EndTransaction, and while the next retry does not execute {Begin,End}Transaction, it does check the timestamp cache for its key and should bump the timestamp. Except, isn't it still the same transaction?

A quick unit test could really answer what happens here without any guessing.

petermattis commented 8 years ago

Agreed about writing a test. Spencer is on it.

Spencer pointed out another issue. Replica.addWriteCmd handles calling Replica.endCmds which adds to the timestamp cache. But Replica.addWriteCmd will return if the client RPC was cancelled (i.e. network error). Seems like we should be adding to the timestamp cache in Replica.processRaftCommand when we're executing on the lease holder (or when sendToClient is true).

spencerkimball commented 8 years ago

I see what's happening. The timestamp cache isn't being updated after success within raft in the event that the caller context is cancelled. This is a major correctness issue. The addition of context cancellations introduced a bug here, but it should be easy to fix. What this means is that Ben's case will now result in a TransactionReplayError in the common case of the transaction returning on replay to the same replica.

However, if the leadership changes, then the replay won't be able to return the TransactionReplayError and will instead return TransactionRetryError. So we still have a problem, it's just going to be significantly less likely to occur.

aphyr commented 8 years ago

I assume this happened in practice with Jepsen testing?

Yep! Here are two example failure cases, which caused single-statement inserts of unique values into a table to result in duplicate values.

20161015T043421.000Z.zip 20161015T015705.000Z.zip

spencerkimball commented 8 years ago

@tschottdorf my analysis was not correct. The tryAbandon() method that we call does the right thing with the timestamp cache for the case of just the client going away and the context being canceled.

So we're back to the more fundamental problem and unknown solution.

tbg commented 8 years ago

Do you have the unit test?

spencerkimball commented 8 years ago

I abandoned the one I was working on because it was too close to the Store that it was beginning to feel overly scripted. I think we actually want to create a unittest which uses GRPC and multiple replicas.

I spent a lot of time last night mulling this over and built a "txn cache" option in the code to measure performance implications. When running block writer on my local machine, performance seems to suffer by ~4%.

tbg commented 8 years ago

I abandoned the one I was working on because it was too close to the Store that it was beginning to feel overly scripted. I think we actually want to create a unittest which uses GRPC and multiple replicas.

I don't see what's wrong with a test that's not more complicated than it needs to be, at least to get a basic idea.

spencerkimball commented 8 years ago

I think the test needs to work using the distributed sender.

tbg commented 8 years ago

Why? There's certainly a need to validate that whatever the fix is works end-to-end, but imo the meat is understanding what happens on the Replica, not orchestrating a complex test across most of the stack right away.

tamird commented 8 years ago

Closed by #10207.

cockroachdb / cockroach

storage: 1PC transactions can be applied twice #10023