cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] #74892

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 78419450178335b31f542bd1b14fefdf4ecee0e8:

          |  1169.0s        0            0.0            9.4      0.0      0.0      0.0      0.0 stockLevel
          |  1170.0s        0            0.0            9.3      0.0      0.0      0.0      0.0 delivery
          |  1170.0s        0           42.0           95.1  31138.5 103079.2 103079.2 103079.2 newOrder
          |  1170.0s        0            6.0            9.4  21474.8 103079.2 103079.2 103079.2 orderStatus
          |  1170.0s        0           48.1           93.7  30064.8 103079.2 103079.2 103079.2 payment
          |  1170.0s        0            6.0            9.4     83.9 103079.2 103079.2 103079.2 stockLevel
          |  1171.0s        0            5.0            9.3  64424.5 103079.2 103079.2 103079.2 delivery
          |  1171.0s        0           44.9           95.1  36507.2 103079.2 103079.2 103079.2 newOrder
          |  1171.0s        0            4.0            9.4  12884.9 103079.2 103079.2 103079.2 orderStatus
          |  1171.0s        0           45.9           93.7  36507.2 103079.2 103079.2 103079.2 payment
          |  1171.0s        0            5.0            9.3   6174.0  85899.3  85899.3  85899.3 stockLevel
        Wraps: (8) COMMAND_PROBLEM
        Wraps: (9) Node 5. Command with error:
          | ``````
          | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4}
          | ``````
        Wraps: (10) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-12308

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 5ad21e3896ee809e9c3ebc28bb22166f1275acca:

          |   882.0s        0            0.0            9.0      0.0      0.0      0.0      0.0 stockLevel
          |   883.0s        0            2.0            9.1  36507.2  38654.7  38654.7  38654.7 delivery
          |   883.0s        0           34.0           92.3  66572.0 103079.2 103079.2 103079.2 newOrder
          |   883.0s        0            3.0            9.1  38654.7 103079.2 103079.2 103079.2 orderStatus
          |   883.0s        0           32.0           90.6  42949.7 103079.2 103079.2 103079.2 payment
          |   883.0s        0            0.0            9.0      0.0      0.0      0.0      0.0 stockLevel
          |   884.0s        0            5.0            9.1 103079.2 103079.2 103079.2 103079.2 delivery
          |   884.0s        0           38.0           92.2  81604.4 103079.2 103079.2 103079.2 newOrder
          |   884.0s        0            5.0            9.1  36507.2 103079.2 103079.2 103079.2 orderStatus
          |   884.0s        0           49.0           90.6  42949.7 103079.2 103079.2 103079.2 payment
          |   884.0s        0            5.0            9.0 103079.2 103079.2 103079.2 103079.2 stockLevel
        Wraps: (8) COMMAND_PROBLEM
        Wraps: (9) Node 5. Command with error:
          | ``````
          | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4}
          | ``````
        Wraps: (10) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 4b41789120e019ab015e6dbb924df763897ebadb:

          |   960.0s        0            3.0           10.9  90194.3 103079.2 103079.2 103079.2 delivery
          |   960.0s        0           34.0          110.2  73014.4 103079.2 103079.2 103079.2 newOrder
          |   960.0s        0            2.0           10.9  45097.2  45097.2  45097.2  45097.2 orderStatus
          |   960.0s        0           35.0          108.0  73014.4 103079.2 103079.2 103079.2 payment
          |   960.0s        0            3.0           10.9   4831.8  90194.3  90194.3  90194.3 stockLevel
          | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
          |   961.0s        0            1.0           10.9 103079.2 103079.2 103079.2 103079.2 delivery
          |   961.0s        0           37.0          110.1  90194.3 103079.2 103079.2 103079.2 newOrder
          |   961.0s        0            5.0           10.9  25769.8 103079.2 103079.2 103079.2 orderStatus
          |   961.0s        0           40.0          107.9  81604.4 103079.2 103079.2 103079.2 payment
          |   961.0s        0            1.0           10.9  40802.2  40802.2  40802.2  40802.2 stockLevel
        Wraps: (8) COMMAND_PROBLEM
        Wraps: (9) Node 5. Command with error:
          | ``````
          | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4}
          | ``````
        Wraps: (10) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 912964e02ddd951c77d4f71981ae18b3894e9084:

          |  1253.0s        0            4.0            8.8  98784.2 103079.2 103079.2 103079.2 stockLevel
          |  1254.0s        0            7.0            8.7  47244.6 103079.2 103079.2 103079.2 delivery
          |  1254.0s        0           36.0           89.5  77309.4 103079.2 103079.2 103079.2 newOrder
          |  1254.0s        0            3.0            8.9   2684.4  64424.5  64424.5  64424.5 orderStatus
          |  1254.0s        0           31.0           88.6  28991.0 103079.2 103079.2 103079.2 payment
          |  1254.0s        0            4.0            8.8   1140.9 103079.2 103079.2 103079.2 stockLevel
          |  1255.0s        0            5.0            8.7 103079.2 103079.2 103079.2 103079.2 delivery
          |  1255.0s        0           50.9           89.5  47244.6 103079.2 103079.2 103079.2 newOrder
          |  1255.0s        0            4.0            8.9  77309.4 103079.2 103079.2 103079.2 orderStatus
          |  1255.0s        0           40.0           88.6  49392.1 103079.2 103079.2 103079.2 payment
          |  1255.0s        0            3.0            8.8  45097.2 103079.2 103079.2 103079.2 stockLevel
        Wraps: (8) COMMAND_PROBLEM
        Wraps: (9) Node 5. Command with error:
          | ``````
          | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4}
          | ``````
        Wraps: (10) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

tbg commented 2 years ago

Error: error in newOrder: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: intent on key /Table/166/1/669/0): "sql txn" meta={id=e741a752 key=/Table/168/1/669/8/0 pri=0.07912940 epo=17 ts=1642609332.165110310,2 min=1642609036.453450943,0 seq=20} lock=true stat=PENDING rts=1642609326.485867639,0 wto=false gul=1642609036.953450943,0 (SQLSTATE 40001)

It looks like a transaction retry error is somehow bubbling up to here: https://github.com/cockroachdb/cockroach/blob/79a4d4ad2295d6cf69083d93022d9cf49557c6fa/pkg/workload/tpcc/worker.go#L231-L234

tbg commented 2 years ago
image

The "last good" run before the failing streak is https://teamcity.cockroachdb.com/viewLog.html?buildId=4115910 ( d6b99e92bf55b6f4a0d79800d67924e04d0b2a6d) and the first failure in the streak 78419450178335b31f542bd1b14fefdf4ecee0e8.

$ git log  --no-merges 78419450178335b31f542bd1b14fefdf4ecee0e8 --not d6b99e92bf55b6f4a0d79800d67924e04d0b2a6d --oneline
ca66a18fa4 execinfrapb: remove ScanVisibility
b37e13d74f sql: clean up unnamed struct in scanColumnsConfig
00912544a5 sql: remove privilege checks at scanNode init time
9dc76f064a sql: remove index flags logic from scanNode
0845c8a2cb sql: simplify scanColumnsConfig
5ac83d9070 sql: add regression tests inserting decimals in scientific notation
48f2808616 sql: don't check column visibility when initializing scanNode
1770c214f9 sql: remove unused scanColumnsConfig field
3afbdb0f50 sql: implement ON CONFLICT ON CONSTRAINT
2490224168 colexechash: combine two conditionals into one in distinct mode
6998af348e colexechash: remove some dead code
0bb31ff1dc colexectestutils: increase test coverage by randomizing batch length
bb2fc51a42 colexechash: cleanup the previous commit
13b4e48afe colexechash: fix an internal error with distinct mode
74b6e343ac tree,parser: add support for ON CONFLICT ON CONSTRAINT
b3877b8775 cdc: Allow webhook sink to provide client certificates to the remote webhook server
afb8dbe096 streampb: delete `stream.pb.go`
5c3e798c08 bazel: upgrade `rules_go` to pull in new changes
785af465ac sql,server: add VIEWACTIVITYREDACTED role
9653dd13ce build: add <release branch> to nightly and latest tag values
6664d0c34d kv: circuit-break requests to unavailable replicas
ad59351e4b echotest: add testing helper
055a55f52c authors: add natelong to authors
19d12a63e7 roachtest: update 22.1 version map to v21.2.4
7577c4e6df cloud: bump orchestrator to v21.2.4

Starting 3x b3877b8775 here: https://teamcity.cockroachdb.com/viewLog.html?buildId=4163457&buildTypeId=Cockroach_Nightlies_RoachtestStress&tab=buildResultsDiv&branch_Cockroach_Nightlies=%3Cdefault%3E

If this passes, then it's likely a SQL/colexec change that's to blame for this change of behavior.

cc @yuzefovich in case you have an immediate idea what could have changed in the propagation of txn retry errors.

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ da01e4c0545f191a0573e1d097ff0366769e0d6b:

          |  1376.0s        0            3.0            7.4 103079.2 103079.2 103079.2 103079.2 stockLevel
          |  1377.0s        0            5.0            7.4 103079.2 103079.2 103079.2 103079.2 delivery
          |  1377.0s        0           26.0           75.7 103079.2 103079.2 103079.2 103079.2 newOrder
          |  1377.0s        0            1.0            7.5  42949.7  42949.7  42949.7  42949.7 orderStatus
          |  1377.0s        0           18.0           74.4 103079.2 103079.2 103079.2 103079.2 payment
          |  1377.0s        0            6.0            7.4 103079.2 103079.2 103079.2 103079.2 stockLevel
          |  1378.0s        0            9.0            7.4 103079.2 103079.2 103079.2 103079.2 delivery
          |  1378.0s        0           19.0           75.7  81604.4 103079.2 103079.2 103079.2 newOrder
          |  1378.0s        0            2.0            7.5    159.4  45097.2  45097.2  45097.2 orderStatus
          |  1378.0s        0           25.9           74.4 103079.2 103079.2 103079.2 103079.2 payment
          |  1378.0s        0            6.0            7.4 103079.2 103079.2 103079.2 103079.2 stockLevel
        Wraps: (8) COMMAND_PROBLEM
        Wraps: (9) Node 5. Command with error:
          | ``````
          | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4}
          | ``````
        Wraps: (10) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

yuzefovich commented 2 years ago

I think it's most likely because of the streamer work (#68430) where we now use leaf txns to issue concurrent requests for index joins in some cases. Notably, I haven't yet implemented the transparent refresh mechanism there, so it's expected that the number of retryable errors increases because of that PR. I guess if we do SET CLUSTER SETTING sql.distsql.use_streamer.enabled = false;, then these failures will go away.

tbg commented 2 years ago

Would you mind making that change? I think the streamer needs to be off by default if it can't properly propagate refresh errors. We're going to catch this in most workloads.

yuzefovich commented 2 years ago

Just to make sure I understand things correctly: generally speaking, propagating a txn retryable error to the client is acceptable because the app must have some kind of retry loop; however, in most of our roachtests we don't tolerate the retryable errors and treat them as a failure of the test. Does this sound right?

tbg commented 2 years ago

The workload here handles retry errors (unless I'm misreading something about where the error occurs). I think what is happening here is that a retry error bubbles up as a regular error, i.e. it can't have had the proper type. Or at least that's what I think we're seeing? The error is returned from this method:

https://github.com/cockroachdb/cockroach/blob/1c66c9547ad95046846b7aba0d9e6d3f4e4fd97b/pkg/workload/tpcc/new_order.go#L133-L438

You can see by inspection that this implies that an error is returned from this block:

https://github.com/cockroachdb/cockroach/blob/1c66c9547ad95046846b7aba0d9e6d3f4e4fd97b/pkg/workload/tpcc/new_order.go#L215

and that will certainly do proper retries?

So my reading was that something in code is doing some (probably less obviously wrong version of)

err := something() // retry err
err = errors.Errorf("oops messing it up %s", err)
return err
yuzefovich commented 2 years ago

Hm, I'm confused. The Streamer doesn't do anything with the errors other than calling GoError: https://github.com/cockroachdb/cockroach/blob/ebda0ecb4aa1fe47f1403635846e342a2cfbfa1b/pkg/kv/kvclient/kvstreamer/streamer.go#L933

No wrapping / error modification is done on the newly-introduced TxnKVStreamer either.


Trying to deconstruct the error message:

error in newOrder: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn

error in newOrder comes from https://github.com/cockroachdb/cockroach/blob/79a4d4ad2295d6cf69083d93022d9cf49557c6fa/pkg/workload/tpcc/worker.go#L233 then ERROR is likely because of pgerror.DefaultSeverity being set in https://github.com/cockroachdb/cockroach/blob/79a4d4ad2295d6cf69083d93022d9cf49557c6fa/pkg/sql/pgwire/pgerror/flatten.go#L44 then restart transaction is https://github.com/cockroachdb/cockroach/blob/79a4d4ad2295d6cf69083d93022d9cf49557c6fa/pkg/sql/pgwire/pgerror/flatten.go#L87 then TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn probably is https://github.com/cockroachdb/cockroach/blob/79a4d4ad2295d6cf69083d93022d9cf49557c6fa/pkg/kv/kvclient/kvcoord/txn_coord_sender.go#L791

Then because TransactionRetryWithProtoRefreshError implements pgerror.ClientVisibleRetryError, the error should have 40001 code which is then used to determine that the error is indeed retryable: https://github.com/cockroachdb/cockroach-go/blob/7a4e30224f1a484982a53f29cd65eebba4d40b92/crdb/tx.go#L192

tbg commented 2 years ago

It does say "(SQLSTATE 40001)" in the error from newOrder above. I think this really really means SQL "did everything right"? Flummoxed by what is going wrong here then.

yuzefovich commented 2 years ago

Yeah, that's what puzzles me too.

yuzefovich commented 2 years ago

I'll kick off this roachtest with the streamer disabled on #75257.

tbg commented 2 years ago

If we're looking for crackpot theories, could it be that we're getting the retry error on a BEGIN?

https://github.dev/cockroachdb/cockroach-go/blob/7a4e30224f1a484982a53f29cd65eebba4d40b92/crdb/tx.go#L158

yuzefovich commented 2 years ago

Lol I hope not.

yuzefovich commented 2 years ago

Hm, all 5 builds failed. I think I kicked them off in a correct way (from https://github.com/cockroachdb/cockroach/tree/disable-streamer branch), so maybe it's not the streamer work after all to blame.

tbg commented 2 years ago

That looks correct. Ugh, another bisection.

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 58ceac139a7e83052171121b28026a7366f16f7e:

          |  1024.0s        0            7.0            9.5  85899.3 103079.2 103079.2 103079.2 delivery
          |  1024.0s        0           31.0           96.0 103079.2 103079.2 103079.2 103079.2 newOrder
          |  1024.0s        0            6.0            9.4  85899.3 103079.2 103079.2 103079.2 orderStatus
          |  1024.0s        0           36.0           93.9  94489.3 103079.2 103079.2 103079.2 payment
          |  1024.0s        0            6.0            9.4  66572.0 103079.2 103079.2 103079.2 stockLevel
          | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
          |  1025.0s        0            4.0            9.5  11274.3 103079.2 103079.2 103079.2 delivery
          |  1025.0s        0           33.0           96.0 103079.2 103079.2 103079.2 103079.2 newOrder
          |  1025.0s        0            3.0            9.4 103079.2 103079.2 103079.2 103079.2 orderStatus
          |  1025.0s        0           36.0           93.8 103079.2 103079.2 103079.2 103079.2 payment
          |  1025.0s        0            4.0            9.4  38654.7 103079.2 103079.2 103079.2 stockLevel
        Wraps: (8) COMMAND_PROBLEM
        Wraps: (9) Node 5. Command with error:
          | ``````
          | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4}
          | ``````
        Wraps: (10) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

tbg commented 2 years ago

FWIW it failed on b3877b8, to my surprise.

b3877b8775 cdc: Allow webhook sink to provide client certificates to the remote webhook server <-- bad
afb8dbe096 streampb: delete `stream.pb.go`
5c3e798c08 bazel: upgrade `rules_go` to pull in new changes
785af465ac sql,server: add VIEWACTIVITYREDACTED role
9653dd13ce build: add <release branch> to nightly and latest tag values
6664d0c34d kv: circuit-break requests to unavailable replicas
ad59351e4b echotest: add testing helper
055a55f52c authors: add natelong to authors
19d12a63e7 roachtest: update 22.1 version map to v21.2.4
7577c4e6df cloud: bump orchestrator to v21.2.4
<-- "good" (probably)
tbg commented 2 years ago

None of this makes sense, going to try 7577c4e6df (build)

tbg commented 2 years ago

(wrong thread)

tbg commented 2 years ago

@cockroachdb/sql-experience could one of you folks take a look here? We're getting this error returned from crdbpgx.ExecuteTx:

Error: error in newOrder: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: intent on key /Table/166/1/669/0): "sql txn" meta={id=e741a752 key=/Table/168/1/669/8/0 pri=0.07912940 epo=17 ts=1642609332.165110310,2 min=1642609036.453450943,0 seq=20} lock=true stat=PENDING rts=1642609326.485867639,0 wto=false gul=1642609036.953450943,0 (SQLSTATE 40001)

This seems to have the correct error code, how can this be bubbling up from the tpcc workload then?

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ dc07599dc9db1acd5afa3a6537297815f25c1fca:

          |  1277.0s        0            3.0            7.0  85899.3 103079.2 103079.2 103079.2 stockLevel
          |  1278.0s        0            7.0            7.0 103079.2 103079.2 103079.2 103079.2 delivery
          |  1278.0s        0           62.1           70.9  40802.2 103079.2 103079.2 103079.2 newOrder
          |  1278.0s        0            7.0            7.1  66572.0  90194.3  90194.3  90194.3 orderStatus
          |  1278.0s        0           65.1           69.7  57982.1 103079.2 103079.2 103079.2 payment
          |  1278.0s        0            2.0            7.0    130.0  90194.3  90194.3  90194.3 stockLevel
          |  1279.0s        0            2.0            7.0  85899.3 103079.2 103079.2 103079.2 delivery
          |  1279.0s        0           60.0           70.9  68719.5 103079.2 103079.2 103079.2 newOrder
          |  1279.0s        0            4.0            7.0  73014.4 103079.2 103079.2 103079.2 orderStatus
          |  1279.0s        0           75.0           69.7  42949.7 103079.2 103079.2 103079.2 payment
          |  1279.0s        0            8.0            7.0  49392.1 103079.2 103079.2 103079.2 stockLevel
        Wraps: (8) COMMAND_PROBLEM
        Wraps: (9) Node 5. Command with error:
          | ``````
          | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4}
          | ``````
        Wraps: (10) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ e1068d77afbd39b162978281c9da7cbea49c1c3a:

          |  1190.0s        0            3.0            8.2  27917.3  90194.3  90194.3  90194.3 stockLevel
          |  1191.0s        0            4.0            8.1  23622.3 103079.2 103079.2 103079.2 delivery
          |  1191.0s        0           46.0           83.0  64424.5 103079.2 103079.2 103079.2 newOrder
          |  1191.0s        0            3.0            8.2   2952.8 103079.2 103079.2 103079.2 orderStatus
          |  1191.0s        0           52.9           81.7  38654.7 103079.2 103079.2 103079.2 payment
          |  1191.0s        0            2.0            8.2     29.4  81604.4  81604.4  81604.4 stockLevel
          |  1192.0s        0            6.0            8.1  77309.4 103079.2 103079.2 103079.2 delivery
          |  1192.0s        0           65.0           83.0  53687.1 103079.2 103079.2 103079.2 newOrder
          |  1192.0s        0            5.0            8.2   3087.0  62277.0  62277.0  62277.0 orderStatus
          |  1192.0s        0           44.0           81.7  32212.3 103079.2 103079.2 103079.2 payment
          |  1192.0s        0            9.0            8.2  26843.5 103079.2 103079.2 103079.2 stockLevel
        Wraps: (8) COMMAND_PROBLEM
        Wraps: (9) Node 5. Command with error:
          | ``````
          | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4}
          | ``````
        Wraps: (10) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 8cd28089c6c7333615ba3201e841839001d2f0e1:

          |  1142.0s        0            2.0            7.6     62.9 103079.2 103079.2 103079.2 stockLevel
          |  1143.0s        0            1.0            7.7  60129.5  60129.5  60129.5  60129.5 delivery
          |  1143.0s        0           34.0           77.8  81604.4 103079.2 103079.2 103079.2 newOrder
          |  1143.0s        0            5.0            7.6  42949.7  73014.4  73014.4  73014.4 orderStatus
          |  1143.0s        0           25.0           76.4  53687.1 103079.2 103079.2 103079.2 payment
          |  1143.0s        0            2.0            7.6  51539.6 103079.2 103079.2 103079.2 stockLevel
          |  1144.0s        0            3.0            7.7  27917.3 103079.2 103079.2 103079.2 delivery
          |  1144.0s        0           26.0           77.7  57982.1 103079.2 103079.2 103079.2 newOrder
          |  1144.0s        0            1.0            7.6    302.0    302.0    302.0    302.0 orderStatus
          |  1144.0s        0           29.9           76.4  66572.0 103079.2 103079.2 103079.2 payment
          |  1144.0s        0            2.0            7.6  26843.5  40802.2  40802.2  40802.2 stockLevel
        Wraps: (8) COMMAND_PROBLEM
        Wraps: (9) Node 5. Command with error:
          | ``````
          | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4}
          | ``````
        Wraps: (10) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

rafiss commented 2 years ago

It could mean that it hit the max retry count and gave up. There was a bug that made the error print out a wrong message, so let me try upgrading to https://github.com/cockroachdb/cockroach-go/tree/v2.2.6 for that fix.

tbg commented 2 years ago

Ah, interesting. I had seen the max retries but remembered that I hit that in the past and that there was a clear error. Seems like a good thing to try - since the correct error code is logged, I have a hard time reasoning about what else it might be.

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ c4c5ca2fdd5a641433a85a28d4dfd3bd4443015d:

          | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
          |  1037.0s        0            0.0            6.3      0.0      0.0      0.0      0.0 delivery
          |  1037.0s        0           19.0           66.0 103079.2 103079.2 103079.2 103079.2 newOrder
          |  1037.0s        0            2.0            6.5 103079.2 103079.2 103079.2 103079.2 orderStatus
          |  1037.0s        0           26.0           64.1 103079.2 103079.2 103079.2 103079.2 payment
          |  1037.0s        0            4.0            6.5  40802.2 103079.2 103079.2 103079.2 stockLevel
          |  1038.0s        0            1.0            6.3 103079.2 103079.2 103079.2 103079.2 delivery
          |  1038.0s        0           25.0           66.0 103079.2 103079.2 103079.2 103079.2 newOrder
          |  1038.0s        0            2.0            6.5 103079.2 103079.2 103079.2 103079.2 orderStatus
          |  1038.0s        0           24.0           64.1 103079.2 103079.2 103079.2 103079.2 payment
          |  1038.0s        0            2.0            6.5 103079.2 103079.2 103079.2 103079.2 stockLevel
        Wraps: (8) COMMAND_PROBLEM
        Wraps: (9) Node 5. Command with error:
          | ``````
          | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-4}
          | ``````
        Wraps: (10) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ e5d1c374c31dc0e80a596c570da8dc45d73f80b8:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpcc/mixed-headroom/n5cpu16/run_1
    monitor.go:127,versionupgrade.go:695,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 137)
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.importLargeBankStep.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:695
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:208
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:414
          | [...repeated from below...]
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (4) monitor command failure
        Wraps: (5) unexpected node event: 2: dead (exit status 137)
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

rafiss commented 2 years ago

I think in the latest run, node 2 died. I don't know why.

16:46:06 test_impl.go:323: test failure: monitor.go:127,versionupgrade.go:695,versionupgrade.go:208,tpcc.go:414,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 137)

All I see in node 2 is

cockroach exited with code 137: Wed Jan 26 16:46:06 UTC 2022

is that an OOM?

yuzefovich commented 2 years ago

is that an OOM?

Yep:

[ 2299.275647] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cockroach.service,task=cockroach,pid=14134,uid=1000
[ 2299.275767] Out of memory: Killed process 14134 (cockroach) total-vm:17495164kB, anon-rss:10913220kB, file-rss:41884kB, shmem-rss:0kB, UID:1000 pgtables:32044kB oom_score_adj:0
[ 2299.844048] oom_reaper: reaped process 14134 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
tbg commented 2 years ago
image

https://share.polarsignals.com/2946090/

I'm seeing elsewhere (https://github.com/cockroachdb/cockroach/issues/68303#issuecomment-1022959384) that we seem to have gotten really bad at distributing the load during IMPORT. Here, the OOM is during importLargeBankStep, so an import too. But - there is no connection, because:

What complicates the situation here is that n2 is running the "old" version, and in fact so is the cluster & never has it run anything higher: v21.2.4

So this failure is strictly a property of the 21.2 branch. Going to assign to bulk-IO as such.

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/bulk-io

ajwerner commented 2 years ago

Is there any chance this is related to https://github.com/cockroachdb/cockroach/issues/76230. I don't see an oom there, but I don't see much of anything there.

cockroach-teamcity commented 2 years ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on release-21.2 @ 31f167ca5bbe404abcb215f80524770ddc8c0163:

          | I220514 14:21:42.147031 1 workload/tpcc/tpcc.go:509  [-] 1  check 3.3.2.1 took 257.678751ms
          | I220514 14:21:54.673612 1 workload/tpcc/tpcc.go:509  [-] 2  check 3.3.2.2 took 12.526485234s
          | I220514 14:21:57.515815 1 workload/tpcc/tpcc.go:509  [-] 3  check 3.3.2.3 took 2.842140408s
          | I220514 14:25:35.024080 1 workload/tpcc/tpcc.go:509  [-] 4  check 3.3.2.4 took 3m37.508110259s
          | I220514 14:25:42.163398 1 workload/tpcc/tpcc.go:509  [-] 5  check 3.3.2.5 took 7.138712008s
          | Error: check failed: 3.3.2.5: pq: inbox communication error: rpc error: code = Canceled desc = context canceled
          | Error: COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./cockroach workload check tpcc --warehouses=909 {pgurl:1}
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
          |
          | stdout:
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    mixed_version_jobs.go:73,versionupgrade.go:207,tpcc.go:444,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:207
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:444
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

adityamaru commented 1 year ago

This is a very old issue on a branch that is EOL.