cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.59k stars 3.71k forks source link

sql: `statement_timeout` not respected for implicit read-write transactions #124987

Closed nvanbenschoten closed 2 weeks ago

nvanbenschoten commented 1 month ago

When testing the behavior of range unavailability on Cockroach, the statement_timeout variable did not appear to be working.

roachprod create local -n3
roachprod put local cockroach
roachprod start local --racks=3

roachprod run local:1 -- './cockroach workload init kv {pgurl:1}'
roachprod sql local:1 -- -e "ALTER DATABASE kv CONFIGURE ZONE USING constraints = COPY FROM PARENT, lease_preferences='[[+rack=0]]'"
roachprod sql local:1 -- -e "ALTER ROLE roachprod SET statement_timeout='1s'"

# Run a kv workload.
roachprod run local:1 -- './cockroach workload run kv --init --concurrency=50 --max-rate=500 --tolerate-errors {pgurl:1}'

# After it has run for a few seconds, kill 2 of 3 nodes.
roachprod stop local:2,3

By default on master (8fcafc0e5774bb62948038edd57723a77c71b5bb), the workload gets stuck for 60 seconds with no successes or errors. It only starts to return errors when replica circuit breakers kick in. We never see query execution canceled due to statement timeout errors.

When we look, these transaction are all stuck under a Txn.Rollback call in sql.cleanupAndFinishOnError. This call is not using a context configured with the statement timeout, so it blocks indefinitely. This undermines the statement timeout.

...
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0x14017533900, {0x10c28fcb0, 0x14017b22ab0}, 0x14007af7b00)
       github.com/cockroachdb/cockroach/pkg/kv/txn.go:1282 +0x224
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).rollback(0x14017533900, {0x10c28fcb0, 0x14017b22ab0})
       github.com/cockroachdb/cockroach/pkg/kv/txn.go:945 +0x104
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Rollback(0x1400bb58e08?, {0x10c28fcb0?, 0x14017b22ab0?})
       github.com/cockroachdb/cockroach/pkg/kv/txn.go:929 +0x70
github.com/cockroachdb/cockroach/pkg/sql.cleanupAndFinishOnError.func1(0x1401bc81a50)
       github.com/cockroachdb/cockroach/pkg/sql/conn_fsm.go:597 +0x70
github.com/cockroachdb/cockroach/pkg/sql.cleanupAndFinishOnError({{0x10c28fcb0, 0x1401057cd80}, {0x10c246d20, 0x14002492260}, {0x10bf9bbe0, 0x1401bc81a50}, {0x10c246d40, 0x14011226890}, {0x10ba59ec0, 0x140112268a0}})
       github.com/cockroachdb/cockroach/pkg/sql/conn_fsm.go:598 +0x64
github.com/cockroachdb/cockroach/pkg/util/fsm.Transitions.apply({0x10c28fcb0?}, {{0x10c28fcb0, 0x1401057cd80}, {0x10c246d20, 0x14002492260}, {0x10bf9bbe0, 0x1401bc81a50}, {0x10c246d40, 0x14011226890}, {0x10ba59ec0, ...}})
       github.com/cockroachdb/cockroach/pkg/util/fsm/fsm.go:106 +0xe8
github.com/cockroachdb/cockroach/pkg/util/fsm.(*Machine).ApplyWithPayload(...)
       github.com/cockroachdb/cockroach/pkg/util/fsm/fsm.go:135
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).txnStateTransitionsApplyWrapper.func1(0x1401bc81988, {0x10c246d40, 0x14011226890}, {0x10ba59ec0, 0x140112268a0})
       github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:3903 +0x138
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).txnStateTransitionsApplyWrapper(0x1401bc81988, {0x10c246d40?, 0x14011226890?}, {0x10ba59ec0, 0x140112268a0}, {0x15c1aaf70, 0x1401b438b30}, 0x10c28fcb0?)
       github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:3904 +0xb4
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0x1401bc81988)
       github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2543 +0xbd4
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0x1401bc81988, {0x10c28fcb0, 0x1401bacab40}, 0x14007bfbf00?, 0x1401b145cb8?, 0x1401baa5050?)
       github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2217 +0x168
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0x14003aefa40?, {0x10c28fcb0?, 0x1401bacab40?}, {0x1401bac3009?}, 0x9?, 0x1401baca990?)

With the following diff to hack in some timeout for this rollback, the workload behaves as expected. Within a second of killing the nodes, I start seeing errors in the workload output.

diff --git a/pkg/sql/conn_fsm.go b/pkg/sql/conn_fsm.go
index 96ddc520ebc..85f675761aa 100644
--- a/pkg/sql/conn_fsm.go
+++ b/pkg/sql/conn_fsm.go
@@ -17,6 +17,7 @@
 package sql

 import (
+       "context"
        "time"

        "github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/isolation"
@@ -592,9 +593,11 @@ func (ts *txnState) finishTxn(ev txnEventType, advance advanceCode) error {
 func cleanupAndFinishOnError(args fsm.Args) error {
        ts := args.Extended.(*txnState)
        func() {
+               ctx, cancel := context.WithTimeout(ts.Ctx, 100*time.Millisecond)
+               defer cancel()
                ts.mu.Lock()
                defer ts.mu.Unlock()
-               _ = ts.mu.txn.Rollback(ts.Ctx)
+               _ = ts.mu.txn.Rollback(ctx)
        }()
        finishedTxnID, _ := ts.finishSQLTxn()

Jira issue: CRDB-39186

rafiss commented 1 month ago

It looks like the issue is that the ts.Ctx is a per-transaction context, but the statement_timeout only applies at the level of a per-statement context. To fix this, I wonder if we need to make the statement context cancellation also cause the transaction's context to be canceled?