Closed bdarnell closed 8 years ago
Distinguish between RPC errors that definitely won't succeed (e.g. connection refused) and those where the outcome is uncertain.
This seems error prone as the code evolves.
At the SQL layer, perform a separate query to figure out what happened.
Ditto.
Remember all committed transaction ids for a time in something like the abort cache. This would be lighter than the full response cache and would only allow us to tell that this scenario has occurred and return a distinct error code for it, but not recover what exactly happened on the earlier attempt.
1PC transactions are only allowed for write-only operations, correct? So what do we need to recover other than whether the transaction committed or not?
I'm not seeing what about this problem is specific to 1PC transactions. If a KV batch contains a Put
and an EndTransaction
and we cleanup the transaction record, replay of the request (Put
+ EndTransaction
) can fail with a WriteTooOldError
and get propagated back up to the SQL layer. Is there something that protects us in this scenario?
Distinguish between RPC errors that definitely won't succeed (e.g. connection refused) and those where the outcome is uncertain.
This seems error prone as the code evolves.
This one doesn't seem that bad to me. We just need two kinds of RPC errors instead of one, and use one for getting an rpc connection and one for an error sending the request. The question here is whether that is precise enough for us or does the GRPC abstraction prevent us from seeing the distinction we want.
At the SQL layer, perform a separate query to figure out what happened.
Ditto.
Yep, this seems very tricky and fragile.
Remember all committed transaction ids for a time in something like the abort cache. This would be lighter than the full response cache and would only allow us to tell that this scenario has occurred and return a distinct error code for it, but not recover what exactly happened on the earlier attempt.
1PC transactions are only allowed for write-only operations, correct? So what do we need to recover other than whether the transaction committed or not?
It's legal at the KV layer to include a Get
in a 1PC transaction, although we don't currently do it. We do use ConditionalPut
, and we need to distinguish ConditionFailedError
from other kinds of errors. (ConditionFailedError
returns the actual value encountered, although again we don't currently use this)
I'm not seeing what about this problem is specific to 1PC transactions. If a KV batch contains a Put and an EndTransaction and we cleanup the transaction record, replay of the request (Put + EndTransaction) can fail with a WriteTooOldError and get propagated back up to the SQL layer. Is there something that protects us in this scenario?
AutoRetry
is unique to 1PC transactions, although if the retry happens at the application level we could still have a problem. However, we accumulate WriteTooOldErrors in transactional requests until the EndTransaction, and a replayed EndTransaction results in a TransactionStatusError (which is not retryable). I'm not completely sure whether we have a problem for multi-phase transactions or not.
This one doesn't seem that bad to me. We just need two kinds of RPC errors instead of one, and use one for getting an rpc connection and one for an error sending the request. The question here is whether that is precise enough for us or does the GRPC abstraction prevent us from seeing the distinction we want.
How would this be used? Would you propagate the error all the way back to the client in the case of an unknown transaction disposition?
(ConditionFailedError returns the actual value encountered, although again we don't currently use this)
I thought we used this in formatting the error when writing to a unique index, but I can't find any code in sql-land which needs the actual value.
I'm not completely sure whether we have a problem for multi-phase transactions or not.
I think a first step is to try and write some tests to replicate this problem. The 1PC issue seems obvious and easily testable. And we could see if multi-phase transactions have a related problem.
How would this be used? Would you propagate the error all the way back to the client in the case of an unknown transaction disposition?
Yeah, I think so. If a request that does not contain EndTransaction ends with an unknown disposition, the internal/client.Txn
can simply retry. If this happens with an EndTransaction (including 1PC), we have to report it all the way back to the end client.
If a request that does not contain EndTransaction ends with an unknown disposition, the internal/client.Txn can simply retry.
Ok, maybe this isn't as fragile as I was imagining. @tamird Do we get a good signal from GRPC for when an RPC failed because the remote is down (and definitely didn't receive the RPC) vs other errors?
We definitely get that signal from our own circuit breaker. We might have to err on the side of caution for the request that trips the breaker and consider it uncertain if we can't tell for sure what grpc's error means.
Cc @cockroachdb/stability
How would this be used? Would you propagate the error all the way back to the client in the case of an unknown transaction disposition?
From the perspective of a client implementer and user, I think it's a great idea to classify every error in your client as known-failed vs indeterminate--maybe via a type (KnownFailure vs MaybeFailure) or a function (indeterminate? err). This is something every application developer has to worry about, so it makes sense to propagate all the way from internals to users. :)
AutoRetry is unique to 1PC transactions, although if the retry happens at the application level we could still have a problem. However, we accumulate WriteTooOldErrors in transactional requests until the EndTransaction, and a replayed EndTransaction results in a TransactionStatusError (which is not retryable). I'm not completely sure whether we have a problem for multi-phase transactions or not.
Note that TxnExecOptions.AutoRetry
is not unique to 1PC txns - we also use it to retry a SQL txn consisting of a whole batch of statement when we know that the client logic is not conditional on reads.
But I think what you said after is spot on: the EndTransaction
is key - it allows us to observe the state of the transaction (and a missing state is also a non-retryable error). Anything but 1PC txns will have EndTransactions
. So why do you think we might have a problem?
@aphyr, I think the philosophy is that, from a client's perspective, network errors on a COMMIT
(or a batch of statements containing the COMMIT
) are always ambiguous. And so are they for "implicit transactions" - statements outside of a BEGIN... COMMIT
. Except in the case of this bug, of course, we fail to return such an error to the client.
I think we discussed at some point having an explicit "ambiguous commit" error specifically for our RELEASE
and COMMIT
statements, but I think we dropped that idea.
I assume this happened in practice with Jepsen testing? @bdarnell, I wouldn't expect a WriteTooOldError
for the case you outline. The successful EndTransaction
call puts a marker in the timestamp cache which will result in the replayed transaction erroring out immediately with TransactionReplayError
on the receipt of the next BeginTransaction
using the same transaction ID. This is not a retryable error.
We strip the begin/end txn requests in Replica.executeWriteBatch
for 1PC transactions before executing them. Does that affect what gets added to the timestamp cache?
The timestamp cache is populated at a higher level so it would see the EndTransaction, and while the next retry does not execute {Begin,End}Transaction
, it does check the timestamp cache for its key and should bump the timestamp. Except, isn't it still the same transaction?
A quick unit test could really answer what happens here without any guessing.
Agreed about writing a test. Spencer is on it.
Spencer pointed out another issue. Replica.addWriteCmd
handles calling Replica.endCmds
which adds to the timestamp cache. But Replica.addWriteCmd
will return if the client RPC was cancelled (i.e. network error). Seems like we should be adding to the timestamp cache in Replica.processRaftCommand
when we're executing on the lease holder (or when sendToClient
is true).
I see what's happening. The timestamp cache isn't being updated after success within raft in the event that the caller context is cancelled. This is a major correctness issue. The addition of context cancellations introduced a bug here, but it should be easy to fix. What this means is that Ben's case will now result in a TransactionReplayError
in the common case of the transaction returning on replay to the same replica.
However, if the leadership changes, then the replay won't be able to return the TransactionReplayError
and will instead return TransactionRetryError
. So we still have a problem, it's just going to be significantly less likely to occur.
I assume this happened in practice with Jepsen testing?
Yep! Here are two example failure cases, which caused single-statement inserts of unique values into a table to result in duplicate values.
@tschottdorf my analysis was not correct. The tryAbandon()
method that we call does the right thing with the timestamp cache for the case of just the client going away and the context being canceled.
So we're back to the more fundamental problem and unknown solution.
Do you have the unit test?
I abandoned the one I was working on because it was too close to the Store
that it was beginning to feel overly scripted. I think we actually want to create a unittest which uses GRPC and multiple replicas.
I spent a lot of time last night mulling this over and built a "txn cache" option in the code to measure performance implications. When running block writer on my local machine, performance seems to suffer by ~4%.
I abandoned the one I was working on because it was too close to the Store that it was beginning to feel overly scripted. I think we actually want to create a unittest which uses GRPC and multiple replicas.
I don't see what's wrong with a test that's not more complicated than it needs to be, at least to get a basic idea.
I think the test needs to work using the distributed sender.
Why? There's certainly a need to validate that whatever the fix is works end-to-end, but imo the meat is understanding what happens on the Replica, not orchestrating a complex test across most of the stack right away.
Closed by #10207.
When a 1PC transaction is retried due to a network failure, the second attempt may return an error even if the first attempt succeeded (and the successful result was masked by the network failure). The most likely error to be seen in this case is
WriteTooOldError
(caused by the transaction seeing its own past write, which by now has been committed, resolved, and scrubbed of its transaction ID), which is atransactionRestartError
. Upon restart, the transaction gets a new timestamp and may perform different writes, which may succeed on a subsequent attempt, leading to the same statement applying twice.This is the more insidious cousin of #6053 and #7604. Those issues are about the failure leaking back to the client; this one is about the fact that if the operation is retried in some situations, it could succeed.
Here is one concrete case in which the error can occur:
rowid
column) and has no secondary indexesThe response cache (removed in #3077 because it was too expensive and retries of non-transactional requests were deemed less important. The 1PC optimization makes its requests effectively non-transactional at the KV layer) handled this by remembering the response so the retry would get the same response.
A few possible solutions that don't require removing the 1PC optimization or bringing back the response cache: