cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.88k stars 3.77k forks source link

roachtest: kv50/enc=false/nodes=4/cpu=96/batch=64 failed [SendError bubbles up] #69419

Closed cockroach-teamcity closed 3 years ago

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ ab1fc343c9a1140191f96353995258e609a84d02:

          | I210826 14:26:22.303165 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
          | I210826 14:26:22.332804 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
          | I210826 14:26:31.980697 643 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
          | I210826 14:26:31.981413 1 workload/cli/run.go:408  [-] 4  creating load generator...
          | I210826 14:26:32.069848 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 88.434076ms)
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r315: replica (n4,s4):3 not lease holder; current lease is repl=(n3,s3):1 seq=4 start=1629987984.434423504,48 epo=1 pro=1629987984.434423504,45 (SQLSTATE XXUUU)
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 8cae60f603ccc4d83137167b3b31cab09be9d41a:

          | I210827 11:06:56.083448 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
          | I210827 11:06:56.110666 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
          | I210827 11:07:04.125043 669 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
          | I210827 11:07:04.125758 1 workload/cli/run.go:408  [-] 4  creating load generator...
          | I210827 11:07:04.217656 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 91.896941ms)
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r212: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):3 seq=5 start=1630062418.179889391,0 epo=1 pro=1630062418.179841008,0 (SQLSTATE XXUUU)
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 44ea1fa0eba8fc78544700ef4afded62ab98a021:

          | I210828 11:29:24.947587 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
          | I210828 11:29:24.976838 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
          | I210828 11:29:33.529682 653 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
          | I210828 11:29:33.530441 1 workload/cli/run.go:408  [-] 4  creating load generator...
          | I210828 11:29:33.609252 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 78.810482ms)
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r507: replica (n1,s1):1 not lease holder; current lease is repl=(n4,s4):3 seq=4 start=1630150168.631056505,0 epo=1 pro=1630150168.630986488,0 (SQLSTATE XXUUU)
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

tbg commented 3 years ago
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28f1142]

goroutine 34401 [running]:
panic(0x4ab4b60, 0x8ebc610)
    /usr/local/go/src/runtime/panic.go:1065 +0x565 fp=0xc00b231c18 sp=0xc00b231b50 pc=0x48b265
runtime.panicmem()
    /usr/local/go/src/runtime/panic.go:212 +0x5b fp=0xc00b231c38 sp=0xc00b231c18 pc=0x4892bb
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:734 +0x173 fp=0xc00b231c70 sp=0xc00b231c38 pc=0x4a2fb3
github.com/cockroachdb/cockroach/pkg/sql/colexecop.Closers.CloseAndLogOnErr(0xc00b987c80, 0x1, 0x1, 0x69031d8, 0xc00c18e1b0, 0xc010a11500, 0x26)
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecop/operator.go:157 +0xc2 fp=0xc00b231d08 sp=0xc00b231c70 pc=0x28f1142
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*ParallelUnorderedSynchronizer).init.func1.1(0xc00c18e1e0, 0xc003bfc300, 0x7f30d567e4f0, 0xc002de6340, 0xc00b987cc0, 0x1, 0x1, 0xc00b987c60, 0x1, 0x1, ...)
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/parallel_unordered_synchronizer.go:227 +0x9e fp=0xc00b231d50 sp=0xc00b231d08 pc=0x2cddb9e
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*ParallelUnorderedSynchronizer).init.func1(0xc003bfc300, 0x7f30d567e4f0, 0xc002de6340, 0xc00b987cc0, 0x1, 0x1, 0xc00b987c60, 0x1, 0x1, 0xc00b987c80, ...)
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/parallel_unordered_synchronizer.go:278 +0xc19 fp=0xc00b231f78 sp=0xc00b231d50 pc=0x2cde8b9
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc00b231f80 sp=0xc00b231f78 pc=0x4c5cc1
created by github.com/cockroachdb/cockroach/pkg/sql/colexec.(*ParallelUnorderedSynchronizer).init
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/parallel_unordered_synchronizer.go:216 +0x149
tbg commented 3 years ago

I do wonder why this wasn't picked up by the dead node detector:

teardown: 14:26:32 cluster.go:519: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3359107-1629959017-96-n5cpu96 --oneshot --ignore-empty-nodes

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 6700cf65e82a029a1ff76a75250c709aa9637170:

          | I210829 13:31:16.325701 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
          | I210829 13:31:16.350689 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
          | I210829 13:31:25.114599 643 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
          | I210829 13:31:25.115458 1 workload/cli/run.go:408  [-] 4  creating load generator...
          | I210829 13:31:25.197668 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 82.208305ms)
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r1038: replica (n3,s3):1 not lease holder; current lease is repl=(n1,s1):4 seq=6 start=1630243884.173452106,0 epo=1 pro=1630243884.173374808,0 (SQLSTATE XXUUU)
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

yuzefovich commented 3 years ago

After fixing the issue with not waiting for concurrent goroutines to exit in the vectorized engine, I now see the following when running this roachtest:

| I210830 21:22:15.612671 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
| I210830 21:22:23.505297 642 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
| I210830 21:22:23.505986 1 workload/cli/run.go:408  [-] 4  creating load generator...
| I210830 21:22:23.599384 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 93.397537ms)
| Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r338: replica (n3,s3):1 not lease holder; current lease is repl=(n4,s4):2 seq=5 start=1630358538.207335498,0 epo=1 pro=1630358538.207254835,0 (SQLSTATE XXUUU)

As a bit of background, in recently merged #68679 we introduced parallelism of TableReaders if the flow is local. The fact that this roachtest uses batches satisfies the requirements to trigger that parallelism. Under the hood, we use DistSQLPlanner.PartitionSpans to split the spans to scan with the help of the RangeIterator coming from the DistSender. This error seems to occur very quickly, probably when the range cache hasn't been populated yet. So I'm wondering whether we need to disable this TableReader parallelism until the range cache is populated or something like that? Is there way to check that easily? @andreimatei maybe you would know?

yuzefovich commented 3 years ago

Hm, #69612 in combination with the following diff

diff --git a/pkg/workload/workloadsql/workloadsql.go b/pkg/workload/workloadsql/workloadsql.go
index ab57432d4f..2d22140488 100644
--- a/pkg/workload/workloadsql/workloadsql.go
+++ b/pkg/workload/workloadsql/workloadsql.go
@@ -183,7 +183,9 @@ func Split(ctx context.Context, db *gosql.DB, table workload.Table, concurrency
                        }
                }
                close(splitCh)
-               return nil
+               log.Info(ctx, "populating the range cache")
+               _, err := db.Exec(fmt.Sprintf(`SELECT count(*) FROM %s`, table.Name))
+               return err
        })
        return g.Wait()
 }

makes the roachtest pass.

yuzefovich commented 3 years ago

Never mind, two commits in #69612 seem to be sufficient.

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ c1ef81f5f435b3cc5bdf8b218532e0779f03a6bf:

          | I210831 14:31:16.514176 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
          | I210831 14:31:16.544727 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
          | I210831 14:31:24.446707 626 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
          | I210831 14:31:24.447385 1 workload/cli/run.go:408  [-] 4  creating load generator...
          | I210831 14:31:24.537475 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 90.088872ms)
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r166: replica (n4,s4):1 not lease holder; current lease is repl=(n3,s3):3 seq=4 start=1630420277.366149094,0 epo=1 pro=1630420277.366087670,0 (SQLSTATE XXUUU)
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

yuzefovich commented 3 years ago

Never mind again, #69612 doesn't address the NotLeaseHolderErrors even with explicitly populating the range cache. I also ran into ERROR: result is ambiguous (context canceled) (SQLSTATE 40003).

I'm pretty sure that #68679 is to blame for these issues, so putting a GA-blocker label.

yuzefovich commented 3 years ago

My current guess is that because we're canceling the context of DistSender.Send once we no longer need the results of that scan, the context cancelation is surfaced back up as these different KV layer errors, but we swallow only the context cancellations at the parallel unordered synchronizer level. As a result, we incorrectly propagate the errors as query errors while we should have swallowed them.

yuzefovich commented 3 years ago

Adding this diff:

diff --git a/pkg/sql/colexec/parallel_unordered_synchronizer.go b/pkg/sql/colexec/parallel_unordered_synchronizer.go
index 8cfb01c1d7..208679d1af 100644
--- a/pkg/sql/colexec/parallel_unordered_synchronizer.go
+++ b/pkg/sql/colexec/parallel_unordered_synchronizer.go
@@ -13,6 +13,7 @@ package colexec
 import (
        "context"
        "fmt"
+       "strings"
        "sync"
        "sync/atomic"

@@ -250,7 +251,9 @@ func (s *ParallelUnorderedSynchronizer) init() {
                                case parallelUnorderedSynchronizerStateRunning:
                                        if err := colexecerror.CatchVectorizedRuntimeError(s.nextBatch[inputIdx]); err != nil {
                                                if s.Ctx.Err() == nil && s.cancelLocalInput[inputIdx] != nil {
-                                                       if errors.Is(err, context.Canceled) || errors.Is(err, cancelchecker.QueryCanceledError) {
+                                                       if strings.Contains(err.Error(), context.Canceled.Error()) ||
+                                                               errors.Is(err, cancelchecker.QueryCanceledError) ||
+                                                               strings.Contains(err.Error(), "routing information detected to be stale") {
                                                                // The input context has been canceled, yet the
                                                                // main context of the synchronizer has not.
                                                                // This indicates that the synchronizer has

seems to make the roachtest pass, but it seems wrong.

@andreimatei @nvanbenschoten do you think it's possible that the context cancellation of the context passed into DistSender.Send would result in routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store error?

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 15b773c71f92d643795e34c922717fde0447f9cd:

          | I210901 15:40:29.912244 1 workload/cli/run.go:367  [-] 1  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2
          | I210901 15:40:29.937446 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
          | I210901 15:40:38.492532 629 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
          | I210901 15:40:38.493199 1 workload/cli/run.go:408  [-] 4  creating load generator...
          | I210901 15:40:38.599071 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 105.87075ms)
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r468: replica (n4,s4):2 not lease holder; current lease is repl=(n3,s3):1 seq=6 start=1630510833.262891633,0 epo=1 pro=1630510833.262821066,0 (SQLSTATE XXUUU)
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 63679e7bf4a0244c6e46381751543cfe3f32388c:

        Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-3397107-1630563711-105-n5cpu96:5 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4} returned
          | stderr:
          | <... some data truncated by circular buffer; go to artifacts for details ...>
          | 03895356 1545360114428396893 -7630802982645163125 -2412649880522497706 8276436791751551110 4099852041501315882 2812033925923526952 1517561503865701602 -4533764583917340454 1836609674033135535 8526496242231401604 -1464482024900729750 -5432907256977240139 3444062068925385278 -7553687596067195979 -5154957861104497469 -8145909536855229858 6382039283835351115 7755137023313500341 -5154957861104497469 -7131410869547337337 -947840536788666113 2988760769632002452 6062625562427443069 -881568112698653220 -595550869735487951 1669611878164363049 298224725687258560 2188771068334664166 -5643663040516065176 -4395082247018327853 2870217450361056801 -1642827806443583899 8501404333473484749 5802232467739737955 8920800402398472689 -8820259446385365140 8764114147014623955 7754487918947368412 4872065639272858038 245530115335869554 5621535123189572439 5137096420127318629 4214118409546051831 -6103616253542949206 4628167778535422295 -1133885607901936952 -6707684978920974030 199724084011602118 -1827335022341300354 4451698727085763375 3566523467080489816 2745634865287590165] err:ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r202: replica (n4,s4):3 not lease holder; current lease is repl=(n3,s3):1 seq=5 start=1630581536.763571326,0 epo=1 pro=1630581536.763506560,0 (SQLSTATE XXUUU) sql:SELECT k, v FROM kv WHERE k IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64)]
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r202: replica (n4,s4):3 not lease holder; current lease is repl=(n3,s3):1 seq=5 start=1630581536.763571326,0 epo=1 pro=1630581536.763506560,0 (SQLSTATE XXUUU)
          | I210902 11:19:03.915537 2148 workload/pgx_helpers.go:56  [-] 8  pgx logger [error]: Exec logParams=map[args:[-5853180611850766502 59 -7765100801823915646 25 -8802330422683735723 f0 5161097972302237638 2e -2035473387894707132 05 1655478108335104027 68 911370292152043281 91 -1732095736658636308 4a -6586961172605227728 3c 3415554670425598422 9d 2490305939174574069 f8 -2450511966672637098 e0 3393394420843357275 b1 7078491635072219244 90 601035597432153623 9d 5288889078968335223 2d 4997858519702370783 a8 -691660313765282288 1e 5309550236682133527 10 3864205041824048719 d7 2842275010156681998 a1 7197730074923411234 36 385587719217043134 55 4042404868645457667 cd 1550398647615827456 26 4121908009600930617 5b 2711409721841277376 63 -739305065137995297 2b 1190424425760426655 7d -6980953042636530100 01 -2667095680534642464 38 7066772793885526327 c5 -480739607250257912 92 1254967685085072045 e5 6184364464174537139 12 7562216481288970326 9b -4994779198751440340 f1 -9204184949051530638 68 -577024995478486456 15 -7681647526651388611 e4 -6466951744425967052 bb -968243186586744812 3d -2326575905928305489 80 7999742071977878713 ee 7005979760449346630 ad -5033100441271435147 73 3243868885518203898 fb -2990572347315502774 05 -1226872547766426702 87 5114951877303468066 5a -7399945300465034974 01 -1268481766337239058 8d 8313763722691697853 4c -7228396841580974621 46 -6962948462141163098 7b 1743812156550199583 4f 7426834511011350021 68 6256576770696463642 27 -5903182753813868330 59 6047593389339092490 c4 -7490151945715312265 a5 -3127521994623360715 95 4505518920786230984 6a 7751655262218580656 7b] err:timeout: context canceled sql:kv-2]
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 42e5f9492d0d8d93638241303bca984fe78baae3:

          | I210903 14:51:57.080951 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
          | I210903 14:52:05.094220 643 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
          | I210903 14:52:05.094951 1 workload/cli/run.go:408  [-] 4  creating load generator...
          | I210903 14:52:05.208300 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 113.347772ms)
          | I210903 14:52:05.395935 2299 workload/pgx_helpers.go:61  [-] 6  pgx logger [error]: Query logParams=map[args:[-8109847698309837236 5058209292518520260 -5530147469381035334 1747033619382914294 7017418739005512875 -2358014717237621985 -8302341603210547287 -2028468248873085622 -1895163067150467034 -8013597603978460706 8801482429265706073 6351514490197923581 -2637915090719830676 -1045026769944542557 -613598276713862968 7723899643837246137 2776783857895378806 3600706713457807552 4447144560849393602 -6512640214809618088 1888151937232397451 3841920450183009781 8051861797983870891 2079515034499768626 3610234603118268281 -481136611672676824 3262028307754469298 -2025140229427276246 8855883982716194614 -2630471628884419953 772833911832898420 6364432287464617598 5587612854538332705 1127145598566479692 7722713035770714410 -4216484208988152898 8925321881446340715 -1298631130249111985 2852204313533531139 8310639806364282907 1407488118307745822 -2084406318820195014 4204800570268765929 2788087087473709443 6959989646779182367 5495351232278076469 6841592627956384857 -4759154306198077991 -7663701253196757745 6186919594556665391 -3949180708444757075 -906541736525164864 1145222059585816693 -5694780525921356753 -6416118184453982139 -1240723466460358343 -4987063972369326684 -1865991297969911880 -5072606873133083385 4743593246572236074 -6572867610188178040 -5039019332693993983 5295881919699410254 -8721997438799952762] err:ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r448: replica (n3,s3):1 not lease holder; current lease is repl=(n2,s2):3 seq=7 start=1630680720.327323796,0 epo=1 pro=1630680720.327256939,0 (SQLSTATE XXUUU) sql:SELECT k, v FROM kv WHERE k IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64)]
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r448: replica (n3,s3):1 not lease holder; current lease is repl=(n2,s2):3 seq=7 start=1630680720.327323796,0 epo=1 pro=1630680720.327256939,0 (SQLSTATE XXUUU)
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 44c95054cfe0108df872d2cda7952fc92259426b:

        Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-3405042-1630737365-89-n5cpu96:5 -- ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4} returned
          | stderr:
          | <... some data truncated by circular buffer; go to artifacts for details ...>
          | 733820865 4046944035560929828 6070908395623657654 2942196773074556590 8773041021791496999 -5493691655833192555 5962480598186018088 1102668627022765224 4292575886556322331 377068516752503928 -3272524856573513081 3223092315987265925 1102668627022765224 6802633057167832302 8925806572076168886 3502653446166586085 -4034917596375771333 8396930428156549772 1620364530033403653 -2159962942925931739 1227531418922043807 6544273765834556728 8546687735915910491 -7209343669337677990 -630465615343209823 5089803903920779410 -4979155856875104342 -5632183279726515516 -2835301272358696229 -5787296814828132664 -8649381443682749401 2375599334895466311 -273830766452389596 -5498239006426617644 -7242230885120580315 -889002919094793525 3431085442219188202 5889335988326965496 -3420969055647742775 -963557543536388234 6642769424582831316 -1743817264288016029 7982584732665993784 -1009326739606503453 2066883589368775255 -7525823887027989948 -8497325453888236975 102327407294776684 7628499505306977905 5257282432382128057 7194227082981248111 -3954668184493435737 3279825110980478220] err:ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r111: replica (n1,s1):3 not lease holder; current lease is repl=(n2,s2):2 seq=4 start=1630765375.452377755,0 epo=1 pro=1630765375.452325159,0 (SQLSTATE XXUUU) sql:SELECT k, v FROM kv WHERE k IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64)]
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r111: replica (n1,s1):3 not lease holder; current lease is repl=(n2,s2):2 seq=4 start=1630765375.452377755,0 epo=1 pro=1630765375.452325159,0 (SQLSTATE XXUUU)
          | I210904 14:23:02.637708 2371 workload/pgx_helpers.go:61  [-] 7  pgx logger [error]: Exec logParams=map[args:[2561073701416371627 90 4394430385590768920 d1 2038786149488949527 4a 7444685391567129595 ba 5731955079637502568 d5 2987128428354613825 53 -6853836693695608360 10 -1755062787442207086 8f -7174753624286148914 b2 -3416362768297508362 e7 4885831952695097172 71 -7662193081186707917 c9 4232917050777471882 c2 6402228316180964499 da -6340660242868254583 9a -1471260237201352016 c6 -566488498119967953 c1 2381569473086943737 8a -3443352120482987687 1b -5082079551409648511 e0 467673683334578805 c6 8182266221166879202 f1 -2170282077710179897 e2 2219193035334649005 75 -877878578583281887 ef 5497888786842805695 cd 6111857553154093413 9c 6832231271761834081 0e -6273144411940669536 c7 6221935957347956465 4e 7838440216712476111 27 -2854059193622990065 b4 -745929879395053740 85 8792555902456196405 cc -441765554037163994 2f -7592572776128115763 08 9175574604365100051 c8 7166709946407824376 a3 -6597833247781245289 75 -1074705031797818057 5d -795925976360375207 7a -4833238135999364466 9a -2650526187185895540 7c -7211202963562638898 69 5715208387658537354 d4 7724273414095420673 25 -45005500217657165 dd 2806986467104644461 64 -8329214110808503016 04 3494341979509874429 a0 -7740402916908212866 40 -7883403306776991654 76 3221750883529035384 02 31496448019748172 2b 3120776656738584027 7f 2973204647534974588 1f 3087193455591742131 95 -4109659345614502861 a7 8246912348201669694 26 4853315007547851290 51 4508258004581146854 2d 4369420957432840006 1c -3689250316088258980 30 3752429975688659990 b7] err:timeout: context canceled sql:kv-2]
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ e369d86b3bac2b9f40d760c3e6e49f55b7e15abe:

          | I210907 14:45:00.144369 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
          | I210907 14:45:08.235040 660 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
          | I210907 14:45:08.235732 1 workload/cli/run.go:408  [-] 4  creating load generator...
          | I210907 14:45:08.326980 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 91.246993ms)
          | I210907 14:45:08.607567 2110 workload/pgx_helpers.go:61  [-] 6  pgx logger [error]: Query logParams=map[args:[-8343173776155126556 -914322950454871517 -685836765764345716 -7209343669337677990 2066883589368775255 3048245980912135216 -8319071614984614704 532981470037740769 5574232536423016768 -5090769401851757963 5313218355915302311 -6639483040728201979 -1009326739606503453 -5369387090647377696 2901961959851777777 3788960330020833741 -2018691486369830587 6211422866502778167 1177158258470849245 -5933828244096768983 -2275635875191238122 2283197493219629322 4127383475252594038 2966967207177351361 -4216484208988152898 7579154030635040905 7581892944513666311 -2490905978748825596 5451347624681530517 -8010102184843434886 -5363918375503941648 -8349707963029555597 -4407905743406431893 3752934507509607716 919082864081731136 -8056551791411552895 -5970401121588204437 2140940003149428359 2501556887624921142 -4236990650138812717 -5503333907021059451 6158057181624196422 4741184920310447324 -1623807323551530983 -4928511084171520986 -5365126547866111147 -5054786067318612342 -728903186720601223 -1789585662423979344 -2592815725797093114 3819459575849458158 -6912569259695542572 9202532695327585148 9112723449915371134 7126266934615051019 -8617536109860315868 -7095912189861582381 1749632190011704816 -2387497166026445323 2680711358950630419 2486984917766768371 -693198230315013616 2521535196169704509 -3160639275118474288] err:ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r366: replica (n2,s2):2 not lease holder; current lease is repl=(n4,s4):3 seq=5 start=1631025902.759767859,0 epo=1 pro=1631025902.759711679,0 (SQLSTATE XXUUU) sql:SELECT k, v FROM kv WHERE k IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64)]
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r366: replica (n2,s2):2 not lease holder; current lease is repl=(n4,s4):3 seq=5 start=1631025902.759767859,0 epo=1 pro=1631025902.759711679,0 (SQLSTATE XXUUU)
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 642e44afbe6098f25022618be76c6f7b6b97df45:

          | I210908 11:55:15.909071 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
          | I210908 11:55:23.816193 659 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
          | I210908 11:55:23.816918 1 workload/cli/run.go:408  [-] 4  creating load generator...
          | I210908 11:55:23.901769 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 84.850564ms)
          | I210908 11:55:24.082719 2086 workload/pgx_helpers.go:61  [-] 6  pgx logger [error]: Query logParams=map[args:[5576443193026923825 6299206555795082188 9157166580729214064 8646777393617720288 -5724883969894799675 -6897578880386467823 -6606375318306027013 -485870624885464806 7917266868945422353 5732879094913020916 -2190717052833672554 3011082493187145961 -3599523743411521019 -3219693044233996185 4298256374823820196 -6235906329266157420 -7681599993107674829 -6797145899829604906 1188031473152778240 5855548288912887416 2117934561670787669 -4742679533635683975 7128934072758675299 3023527810157456407 -6656763920844384398 -8729100882102463148 -6254855190543380724 -7112966269537054025 3476377908817320898 -2427793111266264477 6556128952054453056 726051506204224501 6575645057459341882 2442388724663567514 2781390435253487427 -8351262565230795732 -5749086669974027420 8249328495278473348 4798294684824285396 8378602000272484849 -4850957848834304902 7855571730243500859 -8057288572247355917 -344049463124163735 3726332353230092189 3053229696471589359 -2991827891992307820 807688452188690152 -4821760302956611142 -7517243810638472637 3921778827010504274 5593207286469826406 -4407905743406431893 -211894188292247813 -9086973824871479595 -1080491562993894710 -7380762123636204811 -6431087920246951008 6590873196518367421 -6815829807047622788 5460400427895041357 -2485856532151755454 -4150653389614699183 1764642074488156281] err:ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r103: replica (n3,s3):1 not lease holder; current lease is repl=(n4,s4):2 seq=5 start=1631102117.432882067,0 epo=1 pro=1631102117.432805916,0 (SQLSTATE XXUUU) sql:SELECT k, v FROM kv WHERE k IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64)]
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r103: replica (n3,s3):1 not lease holder; current lease is repl=(n4,s4):2 seq=5 start=1631102117.432882067,0 epo=1 pro=1631102117.432805916,0 (SQLSTATE XXUUU)
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

andreimatei commented 3 years ago

I think I see the problem. The way one of the DistSender replicas loop is written, it's possible that, as a result of a ctx cancellation, we return the error we got when trying out a previous replica instead of returning the ctx error. Which is not good. I'll improve the thing.

But, still, let's discuss about the downstream effects because I'm not sure that it's a good idea for people to rely on a particular contract from the DistSender about the error returned in case of ctx canceled.

My current guess is that because we're canceling the context of DistSender.Send once we no longer need the results of that scan, the context cancelation is surfaced back up as these different KV layer errors, but we swallow only the context cancellations at the parallel unordered synchronizer level. As a result, we incorrectly propagate the errors as query errors while we should have swallowed them.

I've read the code, but can you distill these lines more for me pls? Looks like the synchronizer is trying to figure out if it's draining and, if it is, whether it should swallow an error. First of all, it seems to me that I'd be better if the synchronizer looked at its own state to see if it's draining instead of inferring it from the errors. No?

Second, I'd like to understand the decision about swallowing errors. The thing we don't want, I think, is for LeafTxnFinalStates being potentially missed because some stream coming into the synchronizer dropped but we're swallowing the respective error. So, how come swallowing context.Canceled is not dangerous? Can it not hide the fact that we're missing some metadata?

yuzefovich commented 3 years ago

Looks like the synchronizer is trying to figure out if it's draining and, if it is, whether it should swallow an error.

Yes, that's essentially what we're trying to do.

First of all, it seems to me that I'd be better if the synchronizer looked at its own state to see if it's draining instead of inferring it from the errors. No?

Yeah, we could do that, but that wouldn't be sufficient for our purposes (note that we do assert that the synchronizer is in the draining state in crdb_test builds).

The goal of that code is as follows: we have encountered an error, and the question is - did this error occur precisely because we eagerly canceled the work since the synchronizer moved into the draining state? If that's the case, we want to swallow the error so that it doesn't propagate out.

So checking that the state is, in fact, draining is a good point (I'll make that change), but we also want to make sure not to swallow errors that occurred legitimately (i.e. not because of the eager cancellation triggered in synchronizer.DrainMeta). We need to somehow determine whether the error occurred because of us or because of other "natural" causes, and I don't know of any other way other than comparing the error against the context.Canceled error (as well as making sure that the synchronizer's main context hasn't been canceled).

Second, I'd like to understand the decision about swallowing errors. The thing we don't want, I think, is for LeafTxnFinalStates being potentially missed because some stream coming into the synchronizer dropped but we're swallowing the respective error. So, how come swallowing context.Canceled is not dangerous? Can it not hide the fact that we're missing some metadata?

I believe we are covered on this front. We use this eager context cancellation only for fully local plans (i.e. there are no inboxes, so we won't break the gRPC streams by this eager context cancellation). Next, because we propagate the metadata differently in the vectorized engine (namely, all metadata is buffered and returned in DrainMeta calls) there is no metadata flowing during Operator.Next loop - which just broke by the cancellation - and we let each synchronizer's input proceed to fully draining its metadata sources. All LeafTxnFinalInfo metadata will be created in DrainMeta calls after each input has been interrupted.

Does this make sense?

I know that this error/string matching is not great, and the contract that all internal components cannot convert the context cancellation into any other error that doesn't contain "context canceled" when stringified is fragile, but it seems to work as is. Do you have any other suggestions about swallowing errors that occurred because of canceling the in-flight work? We will need to have something similar for the Streamer too I think.

cockroach-teamcity commented 3 years ago

roachtest.kv50/enc=false/nodes=4/cpu=96/batch=64 failed with artifacts on master @ 2a7ef521a38d923cca9ec96e229237bbedc0c5fa:

          | I210909 13:44:16.178646 1 workload/workloadsql/workloadsql.go:113  [-] 2  starting 1000 splits
          | I210909 13:44:26.612026 665 workload/workloadsql/workloadsql.go:179  [-] 3  finished 1000 of 1000 splits
          | I210909 13:44:26.612695 1 workload/cli/run.go:408  [-] 4  creating load generator...
          | I210909 13:44:26.709769 1 workload/cli/run.go:439  [-] 5  creating load generator... done (took 97.073706ms)
          | I210909 13:44:26.871944 2273 workload/pgx_helpers.go:61  [-] 6  pgx logger [error]: Query logParams=map[args:[8526496242231401604 -3058456452899618217 5987464262667976989 785099293111823248 -8627589847984975831 -5400561583652668213 4443526637927806162 -2912297249704132099 -8941632537087355961 -4707413667279690717 6013468837370444483 -5379950929036845753 3003351574143778344 -9011218919309636232 -5285403095308748438 8910395588515067068 -1421512277127238438 3171725451732318818 -4546977850957315192 -3058456452899618217 6310427116003878183 4635535663943891359 -6502680595284108922 -640991889626537057 -6112247179687866663 6088709996233530012 -4349124448151216062 -3527105843057944707 8173480481670244661 570038588800755927 -7209343669337677990 456616589373783948 -8351262565230795732 3732630649989287647 2246318694329747673 8925806572076168886 -1922850102435472877 8925806572076168886 6841592627956384857 5889483069973162321 4478849822763169017 7481020352057007754 6979143120838471428 -2601042773779341766 -6249445251639448483 1464797858451214926 5855755066758936012 -4940151983143948250 7245995482278240994 -2159962942925931739 4902197643031388349 9141638872723910047 -2044805944418399938 5853035500677098133 -1405569601905406217 7579154030635040905 3160077491502345367 -4742679533635683975 -170273184759293082 1081876405683310102 4282119944930224008 -9066437006354549147 3509946820652962650 1459761182510575201] err:ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r110: replica (n1,s1):4 not lease holder; current lease is repl=(n3,s3):2 seq=2 start=1631195057.308897366,0 epo=1 pro=1631195057.308851404,0 (SQLSTATE XXUUU) sql:SELECT k, v FROM kv WHERE k IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64)]
          | Error: ERROR: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] lease held by different store; r110: replica (n1,s1):4 not lease holder; current lease is repl=(n3,s3):2 seq=2 start=1631195057.308897366,0 epo=1 pro=1631195057.308851404,0 (SQLSTATE XXUUU)
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./workload run kv --init --histograms=perf/stats.json --concurrency=256 --splits=1000 --duration=10m0s --read-percent=50 --batch=64 {pgurl:1-4}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,kv.go:137,kv.go:246,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:137
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKV.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:246
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

andreimatei commented 3 years ago

Second, I'd like to understand the decision about swallowing errors. The thing we don't want, I think, is for LeafTxnFinalStates being potentially missed because some stream coming into the synchronizer dropped but we're swallowing the respective error. So, how come swallowing context.Canceled is not dangerous? Can it not hide the fact that we're missing some metadata?

I believe we are covered on this front. We use this eager context cancellation only for fully local plans (i.e. there are no inboxes, so we won't break the gRPC streams by this eager context cancellation). Next, because we propagate the metadata differently in the vectorized engine (namely, all metadata is buffered and returned in DrainMeta calls) there is no metadata flowing during Operator.Next loop - which just broke by the cancellation - and we let each synchronizer's input proceed to fully draining its metadata sources. All LeafTxnFinalInfo metadata will be created in DrainMeta calls after each input has been interrupted.

Does this make sense?

It makes some sense. I was thinking that some distinction between "fully local plans" and the more general case is probably part of the answer. Still:

but we also want to make sure not to swallow errors that occurred legitimately (i.e. not because of the eager cancellation triggered in synchronizer.DrainMeta). We need to somehow determine whether the error occurred because of us or because of other "natural" causes, and I don't know of any other way other than comparing the error against the context.Canceled error (as well as making sure that the synchronizer's main context hasn't been canceled).

I wonder why exactly this is - why do we need a distinction between errors caused by the cancellation and other errors? It seems to me that, as long as we can get all the metadata, any error received after the synchronizer started draining can be ignored. And if the everything is local, we can always get the metadata. Right? What helps here is that DistSQL only deals with reads. For writes, things are more complicated. The client state needs to be kept in sync with the distributed transaction state, and an interrupted DistSender RPC can break that. In particular, if a request is writing intents, the client needs to know the highest timestamp that an intent was written at. That's why the TxnCoordSender distinguishes two categories of errors - errors for which this information is known, and errors for which it isn't. The former are "recoverable" - i.e. the txn can still be used afterwards (generally, to rollback to a savepoint). But, again, for reads I think we don't need to worry about that. All errors are recoverable.

This is all in theory, though. In practice, the TxnCoordSender always blocks further requests after most errors (including ctx cancelled): https://github.com/cockroachdb/cockroach/blob/b3a9b702bc9825331aac5848dcb64d525a30827a/pkg/kv/kvclient/kvcoord/txn_coord_sender.go#L873-L882

So, I guess my question is - how does your swallowing of ctx cancelled errors in the synchronizer even work given that the TxnCoordSender remembers the error state? Do we perhaps only do it when using a LeafTxn?

yuzefovich commented 3 years ago

I wonder why exactly this is - why do we need a distinction between errors caused by the cancellation and other errors? It seems to me that, as long as we can get all the metadata, any error received after the synchronizer started draining can be ignored.

I was also thinking about this when I originally added this code, and I came up with a following scenario: imagine we have TABLE t (c INT) with rows 0 and 1, then imagine we're performing SELECT 1 / c FROM t LIMIT 1 and that we have the parallel unordered sync with two concurrent TableReaders. In case 1 is read first, the limit is satisfied, so the sync transitions into draining, and the question is: do we want to swallow the "division by zero" error that might occur if we read row 0 and perform the division? My thinking was that it is "safer" to not swallow the error and propagate it back to the client. I do, however, see your point that if the synchronizer is draining, then any work can be abandoned because the caller of the synchronizer is only interested in the metadata from now on.

Another scenario is when we have a "race" between the input goroutine doing work and the synchronizer transitioning into the draining state. Imagine that the input goroutine has performed some work and encountered an error that should be returned back to the client, but before colexecerror.InternalError is executed, the goroutine is put to sleep. Now, the main synchronizer goroutine wakes up and transitions into the draining state, and the question is again the same - do we want to swallow that error or not.

And if the everything is local, we can always get the metadata. Right?

Yep, that's right.

Now that we're discussing this, your suggestion of swallowing all errors once the synchronizer transitioned into draining and performed the context cancellation of the inputs seems like it might be actually a safer choice - this will allow us to not require any invariants from the internal components about how they handle the context cancellation errors. I guess the scenarios I mentioned above could go either way anyway.

So, I guess my question is - how does your swallowing of ctx cancelled errors in the synchronizer even work given that the TxnCoordSender remembers the error state? Do we perhaps only do it when using a LeafTxn?

We are using this functionality of eager cancellation only in very limited set of circumstances - the query doesn't have any mutations, is executed locally, and a few other requirements, so we aren't concerned about writes. And we do use LeafTxn if we plan multiple parallel processors for a single stage of the physical plan.

andreimatei commented 3 years ago

In my opinion, for both scenarios you describe, swallowing errors is fine. When an error races with a drain, I think either outcome is fine. In the SELECT 1 / c FROM t LIMIT 1 case, receiving or not receiving the error is non-deterministic because the ordering or the results is non-deterministic. If you want determinism, you gotta specify an ordering. In the 2nd example, infrastructure errors are nondetereministic because they just are nondeterministic by nature. The cancellation race perhaps makes them slightly more non-deterministic, but they were non-deterministic to begin with.

We are using this functionality of eager cancellation only in very limited set of circumstances - the query doesn't have any mutations, is executed locally, and a few other requirements, so we aren't concerned about writes. And we do use LeafTxn if we plan multiple parallel processors for a single stage of the physical plan.

So we only use a synchronizer is we have parallelism, and if we have parallelism, we're using a LeafTxn for all parallel branches? We don't use the root txn on any of the branches? Cause if we were, and it was the request on the root that was getting canceled, you wouldn't be able to swallow that error (unfortunately). Figuring out the root txn behavior after an error is a can of worms that we keep passing down the road; the current story is not very good.

Now that we're discussing this, your suggestion of swallowing all errors once the synchronizer transitioned into draining and performed the context cancellation of the inputs seems like it might be actually a safer choice - this will allow us to not require any invariants from the internal components about how they handle the context cancellation errors. I guess the scenarios I mentioned above could go either way anyway.

Yeah; if the root is not in play here, then this is what I'd do. I'm looking into improving things in the TxnCoordSender because it's a reasonable request to have the ctx cancelation be on the returned error chain as much as possible, but I don't think I'll be able to make any guarantees about it. There's at least two problems:

  1. If the ctx cancelation interrupts the evaluation of a request, then that request will return some error (which hopefully Is() a ctx cancelled, but who knows). I'm not going to override that error.
  2. If the DistSender has gotten an ambiguous error in the past (i.e. it might have committed the transaction), and the ctx cancelation comes while the DistSender is trying to resolve the ambiguity of that error, I want to return the ambiguous error, not override it. There's a related PR #65714 that flirts with making AmbiguousResultError implement causer so that we can put the ctx cancellation as a cause, but that still doesn't cover everything. This 2nd case probably doesn't affect you cause you're not committing anything, but still all this says that I wouldn't rely on the DistSender identifying the ctx cancellation always.
yuzefovich commented 3 years ago

So we only use a synchronizer is we have parallelism, and if we have parallelism, we're using a LeafTxn for all parallel branches? We don't use the root txn on any of the branches?

Yes, we will use only the LeafTxn for the whole flow if the flow has any concurrency (which we can determine during the physical planning): https://github.com/cockroachdb/cockroach/blob/db49ac03bfd6f1f928f3e49c3e39a0d65a49486f/pkg/sql/distsql/server.go#L307-L313

Thanks for the context, it makes sense. I think #69961 without any other changes is sufficient to fix this issue.