Closed nvanbenschoten closed 2 weeks ago
It looks like the issue is that the ts.Ctx
is a per-transaction context, but the statement_timeout
only applies at the level of a per-statement context. To fix this, I wonder if we need to make the statement context cancellation also cause the transaction's context to be canceled?
When testing the behavior of range unavailability on Cockroach, the
statement_timeout
variable did not appear to be working.By default on master (8fcafc0e5774bb62948038edd57723a77c71b5bb), the workload gets stuck for 60 seconds with no successes or errors. It only starts to return errors when replica circuit breakers kick in. We never see
query execution canceled due to statement timeout
errors.When we look, these transaction are all stuck under a
Txn.Rollback
call insql.cleanupAndFinishOnError
. This call is not using a context configured with the statement timeout, so it blocks indefinitely. This undermines the statement timeout.With the following diff to hack in some timeout for this rollback, the workload behaves as expected. Within a second of killing the nodes, I start seeing errors in the workload output.
Jira issue: CRDB-39186