Closed cockroach-teamcity closed 5 years ago
SHA: https://github.com/cockroachdb/cockroach/commits/d0a93d286ee42ceb94f986b6a06a1afd52cf914e
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149020&tab=buildLog
The test failed on master:
cluster.go:1226,cdc.go:612,cdc.go:124,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1149020-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
Error: read tcp 10.142.0.61:56362->10.142.0.130:26257: read: connection reset by peer
Error: exit status 1
: exit status 1
cluster.go:1585,cdc.go:212,cdc.go:367,test.go:1211: Goexit() was called
^-- something bad is going on in this last one -- goroutine count on n1 explodes:
[n1] runtime stats: 5.6 GiB RSS, 8539 goroutines
Errors just before that in the logs:
190223 08:11:09.333526 116 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {teamcity-1149020-cdc-tpcc-1000-rangefeed-true-0001:26257 0
}. Err :connection error: desc = "transport: failed to write client preface: io: read/write on closed pipe". Reconnecting...
This happens multiple times and is odd because this node itself is -0001:26257
.
Then this message is noteworthy because of the insane behind
count:
I190223 08:11:15.731353 50034 ccl/changefeedccl/changefeed_processors.go:562 [n1] job 428733347495510017 span /Table/53/1/{0/1/-3001/1/0-1/10/-1031/8} is behind by 430808h11m15.72977394s
The goroutine count returns to normal with this message:
I190223 08:12:07.865814 50034 ccl/changefeedccl/changefeed_processors.go:562 [n1] job 428733347495510017 span /Table/53/1/{1/10/-1031/8/0-3/10/-1952/3} is behind by 430808h12m7.865763101s I190223 08:12:15.587174 50034 ccl/changefeedccl/changefeed_stmt.go:444 [n1] CHANGEFEED job 428733347495510017 returning with error: cannot update progress on canceled job (id 428733347495510017)
Might be coincidence, but perhaps not.
In the test runner logs itself, the failure is
08:10:34 cluster.go:253: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1149020-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}
changefeed: 08:10:34 cdc.go:187: started changefeed at (1550909433231766446) 2019-02-23 08:10:33.231766446 +0000 UTC
2: 3483
1: 4776
3: 4766
Error: read tcp 10.142.0.61:56362->10.142.0.130:26257: read: connection reset by peer
Error: exit status 1
I don't know who prints the Error:
lines; it's weird that there's no stack trace. Is this from the workload?
Btw, none of the nodes seem to have died or been oom-killed.
@danhhz any ideas?
ps #33318 would have helped here as well.
SHA: https://github.com/cockroachdb/cockroach/commits/71681f60031c101f17339236e2ba75f7a684d1a1
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1155867&tab=buildLog
The test failed on master:
cluster.go:1226,cdc.go:612,cdc.go:124,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1155867-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
0 1.0 18.4 7784.6 7784.6 7784.6 7784.6 stockLevel
27s 0 17.0 19.2 7247.8 8321.5 8321.5 8321.5 delivery
27s 0 501.4 128.2 8589.9 9126.8 9126.8 9126.8 newOrder
27s 0 55.9 19.8 6174.0 7784.6 8321.5 8321.5 orderStatus
27s 0 222.7 160.0 23622.3 24696.1 24696.1 24696.1 payment
27s 0 43.9 19.4 6442.5 8321.5 8589.9 8589.9 stockLevel
28s 0 69.1 21.0 4026.5 8321.5 8321.5 9126.8 delivery
28s 0 336.5 135.6 9663.7 10200.5 10200.5 10200.5 newOrder
28s 0 52.1 20.9 2952.8 5368.7 6174.0 6174.0 orderStatus
28s 0 309.5 165.3 7516.2 24696.1 25769.8 25769.8 payment
28s 0 57.1 20.7 3355.4 6442.5 7247.8 7247.8 stockLevel
: signal: killed
cluster.go:1585,cdc.go:212,cdc.go:367,test.go:1211: unexpected status: failed
SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog
The test failed on master:
cluster.go:1226,cdc.go:612,cdc.go:124,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1158877-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
0 63.8 16.0 2281.7 4831.8 13421.8 15032.4 stockLevel
43s 0 15.0 17.2 209.7 402.7 419.4 419.4 delivery
43s 0 332.7 105.3 20401.1 24696.1 25769.8 25769.8 newOrder
43s 0 17.0 15.8 18.9 71.3 121.6 121.6 orderStatus
43s 0 568.4 146.0 4563.4 19327.4 40802.2 40802.2 payment
43s 0 39.0 16.5 335.5 11274.3 16106.1 16106.1 stockLevel
44s 0 22.1 17.3 503.3 1342.2 1476.4 1476.4 delivery
44s 0 270.9 109.1 3489.7 25769.8 25769.8 25769.8 newOrder
44s 0 23.1 16.0 16.3 62.9 75.5 75.5 orderStatus
44s 0 320.0 149.9 2818.6 23622.3 40802.2 40802.2 payment
44s 0 20.1 16.6 209.7 17179.9 17179.9 17179.9 stockLevel
: signal: killed
cluster.go:1585,cdc.go:212,cdc.go:367,test.go:1211: unexpected status: failed
SHA: https://github.com/cockroachdb/cockroach/commits/e6f0a720a98c60ed0545e7cbb4b2f224b463cae9
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1165130&tab=buildLog
The test failed on master:
cdc.go:733,cdc.go:214,cdc.go:367,test.go:1214: initial scan did not complete
SHA: https://github.com/cockroachdb/cockroach/commits/38bb1e7905b89f911bd74be4f5830217ffb7b343
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1168752&tab=buildLog
The test failed on master:
cdc.go:733,cdc.go:214,cdc.go:367,test.go:1214: initial scan did not complete
SHA: https://github.com/cockroachdb/cockroach/commits/5f7de3c35348e847dc0d4ab1ba8d76574e1706c2
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1169980&tab=buildLog
The test failed on master:
cluster.go:1244,cdc.go:612,cdc.go:124,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1169980-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
0 19.0 21.2 125.8 192.9 192.9 192.9 stockLevel
15m10s 0 23.9 21.1 113.2 260.0 260.0 260.0 delivery
15m10s 0 209.5 211.2 60.8 209.7 268.4 268.4 newOrder
15m10s 0 27.9 21.3 13.6 32.5 44.0 44.0 orderStatus
15m10s 0 212.5 212.5 35.7 201.3 234.9 268.4 payment
15m10s 0 21.9 21.2 100.7 176.2 184.5 184.5 stockLevel
15m11s 0 27.1 21.1 92.3 130.0 159.4 159.4 delivery
15m11s 0 176.4 211.1 54.5 83.9 113.2 167.8 newOrder
15m11s 0 23.1 21.3 13.1 27.3 37.7 37.7 orderStatus
15m11s 0 197.5 212.5 29.4 62.9 113.2 285.2 payment
15m11s 0 16.0 21.2 104.9 142.6 201.3 201.3 stockLevel
: signal: killed
cluster.go:1603,cdc.go:212,cdc.go:367,test.go:1214: unexpected status: failed
SHA: https://github.com/cockroachdb/cockroach/commits/a512e390f7f2f2629f3f811bab5866c46e3e5713
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1174765&tab=buildLog
The test failed on provisional_201903122203_v19.1.0-beta.20190318:
cluster.go:1246,cdc.go:625,cdc.go:125,cluster.go:1584,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1174765-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
158.4 0.0 0.0 0.0 0.0 newOrder
12m8s 717 0.0 16.2 0.0 0.0 0.0 0.0 orderStatus
12m8s 717 0.0 161.4 0.0 0.0 0.0 0.0 payment
12m8s 717 0.0 16.2 0.0 0.0 0.0 0.0 stockLevel
E190313 06:13:23.769766 1 workload/cli/run.go:419 error in orderStatus: dial tcp 10.128.0.31:26257: connect: connection refused
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
12m9s 730 0.0 16.1 0.0 0.0 0.0 0.0 delivery
12m9s 730 0.0 158.2 0.0 0.0 0.0 0.0 newOrder
12m9s 730 0.0 16.2 0.0 0.0 0.0 0.0 orderStatus
12m9s 730 0.0 161.2 0.0 0.0 0.0 0.0 payment
12m9s 730 0.0 16.2 0.0 0.0 0.0 0.0 stockLevel
: signal: killed
cluster.go:1605,cdc.go:213,cdc.go:368,test.go:1214: unexpected node event: 2: dead
^- hey look at this. Also, 13k goroutines. Going to look at the breakdown shortly
W190313 06:10:20.813840 358 server/node.go:880 [n2,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:sys.goroutines Value:13173 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
I190313 06:10:26.345689 356 server/status/runtime.go:464 [n2] runtime stats: 6.0 GiB RSS, 13330 goroutines, 582 MiB/271 MiB/1.4 GiB GO alloc/idle/total, 4.0 GiB/4.8 GiB CGO alloc/total, 194757.3 CGO/sec, 905.2/160.2 %(u/s)time, 0.0 %gc (7x), 91 MiB/116 MiB (r/w)net
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8f78e8c pc=0x7f90dc4b5a4a]
runtime stack:
runtime.throw(0x32d2607, 0x2a)
/usr/local/go/src/runtime/panic.go:608 +0x72 fp=0x7f8c2a3fe7e8 sp=0x7f8c2a3fe7b8 pc=0x72a972
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:374 +0x2f2 fp=0x7f8c2a3fe838 sp=0x7f8c2a3fe7e8 pc=0x740762
goroutine 66830 [syscall]:
non-Go function
pc=0x7f90dc4b5a4a
non-Go function
pc=0x2703f72
non-Go function
pc=0x2959a3f
non-Go function
pc=0x2951bd4
non-Go function
pc=0x2701798
non-Go function
pc=0x2706399
non-Go function
pc=0x2559612
non-Go function
pc=0x75a62f
runtime.cgocall(0x25595d0, 0xc048731e70, 0x2cfba80)
/usr/local/go/src/runtime/cgocall.go:128 +0x5e fp=0xc048731e18 sp=0xc048731de0 pc=0x700f1e
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBIterSeek(0x7f90bfe42080, 0xc03a0f3840, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
_cgo_gotypes.go:752 +0x74 fp=0xc048731e70 sp=0xc048731e18 pc=0x134d0c4
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek.func2(0x7f90bfe42080, 0xc03a0f3840, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2125 +0xb4 fp=0xc048731f38 sp=0xc048731e70 pc=0x13653e4
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek(0xc0466a8268, 0xc03a0f3840, 0xa, 0x14, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2125 +0x161 fp=0xc048732190 sp=0xc048731f38 pc=0x1359a21
github.com/cockroachdb/cockroach/pkg/storage/engine.(*batchIterator).Seek(0xc0466a8268, 0xc03a0f3840, 0xa, 0x14, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1427 +0x5b fp=0xc0487321d0 sp=0xc048732190 pc=0x135599b
github.com/cockroachdb/cockroach/pkg/storage/engine.mvccGetMetadata(0x3a177c0, 0xc0466a8268, 0xc03a0f3840, 0xa, 0x14, 0x0, 0x0, 0xc02201ed80, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/mvcc.go:747 +0x7c fp=0xc048732318 sp=0xc0487321d0 pc=0x133a25c
github.com/cockroachdb/cockroach/pkg/storage/engine.mvccPutInternal(0x39d80a0, 0xc035793410, 0x7f90dd2ff100, 0xc026826240, 0x3a177c0, 0xc0466a8268, 0xc0380f3580, 0xc03a0f3840, 0xa, 0x14, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/mvcc.go:1285 +0xe6 fp=0xc0487329a0 sp=0xc048732318 pc=0x133d696
github.com/cockroachdb/cockroach/pkg/storage/engine.mvccPutUsingIter(0x39d80a0, 0xc035793410, 0x7f90dd2ff100, 0xc026826240, 0x3a177c0, 0xc0466a8268, 0xc0380f3580, 0xc03a0f3840, 0xa, 0x14, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/mvcc.go:1104 +0x164 fp=0xc048732a80 sp=0xc0487329a0 pc=0x133c8a4
github.com/cockroachdb/cockroach/pkg/storage/engine.MVCCDelete(0x39d80a0, 0xc035793410, 0x7f90dc29b878, 0xc026826240, 0xc0380f3580, 0xc03a0f3840, 0xa, 0x14, 0x158b6fcf00e5b32e, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/mvcc.go:1074 +0x1c1 fp=0xc048732b50 sp=0xc048732a80 pc=0x133c6e1
github.com/cockroachdb/cockroach/pkg/storage/batcheval.Delete(0x39d80a0, 0xc035793410, 0x7f90dc29b878, 0xc026826240, 0x3a1f500, 0xc0053bfb00, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/batcheval/cmd_delete.go:36 +0x123 fp=0xc048732c28 sp=0xc048732b50 pc=0x14f2fc3
github.com/cockroachdb/cockroach/pkg/storage.evaluateCommand(0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0xa, 0x7f90dc29b878, 0xc026826240, 0x3a1f500, 0xc0053bfb00, 0xc0380f3580, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_evaluate.go:393 +0x280 fp=0xc048733320 sp=0xc048732c28 pc=0x195e060
github.com/cockroachdb/cockroach/pkg/storage.evaluateBatch(0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0x7f90dc29b878, 0xc026826240, 0x3a1f500, 0xc0053bfb00, 0xc0380f3580, 0x158b6fcf00e5b32e, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_evaluate.go:225 +0x31d fp=0xc048733dd0 sp=0xc048733320 pc=0x195d04d
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).evaluateWriteBatchWithLocalRetries(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0x3a1f500, 0xc0053bfb00, 0xc0380f3580, 0x158b6fcf00e5b32e, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_write.go:447 +0x224 fp=0xc0487341c8 sp=0xc048733dd0 pc=0x1992804
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).evaluateWriteBatch(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_write.go:385 +0x215 fp=0xc0487350a8 sp=0xc0487341c8 pc=0x19912d5
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).evaluateProposal(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go:846 +0x174 fp=0xc0487358a0 sp=0xc0487350a8 pc=0x196a494
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).requestToProposal(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go:943 +0xec fp=0xc0487359b0 sp=0xc0487358a0 pc=0x196afec
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).propose(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0x158b6d62bddb4ad1, 0x0, 0x0, 0x200000002, 0x3, 0x0, 0xc01394f7b0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:147 +0x4ad fp=0xc048736128 sp=0xc0487359b0 pc=0x196ca8d
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryExecuteWriteBatch(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_write.go:171 +0x68b fp=0xc048736860 sp=0xc048736128 pc=0x198fd5b
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeWriteBatch(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_write.go:51 +0x7a fp=0xc048736938 sp=0xc048736860 pc=0x198f59a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0x1a5, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:514 +0x6a6 fp=0xc048736ab0 sp=0xc048736938 pc=0x1947b06
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc0053bfb00, 0x39d80a0, 0xc0357933b0, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:462 +0x83 fp=0xc048736b70 sp=0xc048736ab0 pc=0x1947423
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc0010e8000, 0x39d80a0, 0xc0357933b0, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2896 +0x70c fp=0xc0487370a8 sp=0xc048736b70 pc=0x19a789c
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc0006a8be0, 0x39d80a0, 0xc035793350, 0x0, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:185 +0xcb fp=0xc048737198 sp=0xc0487370a8 pc=0x19c20bb
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x39d80a0, 0xc035793350, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:981 +0x1b0 fp=0xc0487372a8 sp=0xc048737198 pc=0x24b4df0
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc0001e10e0, 0x39d80a0, 0xc035793350, 0x3281312, 0x10, 0xc048737338, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:304 +0xee fp=0xc0487372e0 sp=0xc0487372a8 pc=0x1084d5e
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000c7a000, 0x39d80a0, 0xc035793350, 0xc024fb1580, 0xc035793350, 0x38, 0x20300d)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:968 +0x16a fp=0xc048737368 sp=0xc0487372e0 pc=0x248e2fa
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000c7a000, 0x39d80a0, 0xc035793350, 0xc024fb1580, 0x0, 0xc048737438, 0x7e4f2c)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:1009 +0x9c fp=0xc0487373e8 sp=0xc048737368 pc=0x248e40c
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(0x39ab8a0, 0xc000c7a000, 0x39d80a0, 0xc0357932f0, 0xc024fb1580, 0x0, 0x0, 0x0, 0xc035793140, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:456 +0x4f fp=0xc048737430 sp=0xc0487373e8 pc=0x1521fff
github.com/cockroachdb/cockroach/pkg/rpc.(*internalClientAdapter).Batch(0xc000220ab0, 0x39d80a0, 0xc0357932f0, 0xc024fb1580, 0x0, 0x0, 0x0, 0x158b6fcf00e5b32e, 0x0, 0x0)
<autogenerated>:1 +0x99 fp=0xc048737498 sp=0xc048737430 pc=0x152da59
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).sendBatch(0xc0357932c0, 0x39d80a0, 0xc0357932f0, 0x2, 0x39b6fe0, 0xc000220ab0, 0x0, 0x0, 0x200000002, 0x3, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:239 +0x126 fp=0xc048737550 sp=0xc048737498 pc=0x157bb76
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext.func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:207 +0xa1 fp=0xc048737638 sp=0xc048737550 pc=0x158fa01
github.com/cockroachdb/cockroach/pkg/kv.withMarshalingDebugging(0x39d80a0, 0xc0357932f0, 0x0, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:188 +0xab fp=0xc048737670 sp=0xc048737638 pc=0x157b70b
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext(0xc0357932c0, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:206 +0x211 fp=0xc048737850 sp=0xc048737670 pc=0x157b971
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc00011b320, 0x39d80a0, 0xc035793140, 0xc00011b370, 0x1a5, 0xc02b9875e0, 0x3, 0x3, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1363 +0x2d3 fp=0xc048737bf8 sp=0xc048737850 pc=0x15719d3
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc00011b320, 0x39d80a0, 0xc035793140, 0x1a5, 0xc02b9875e0, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x244 fp=0xc048737d18 sp=0xc048737bf8 pc=0x156bcb4
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc00011b320, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500600, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x221 fp=0xc048737e98 sp=0xc048737d18 pc=0x156c231
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc00011b320, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500600, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1139 +0x322 fp=0xc048738448 sp=0xc048737e98 pc=0x156fb22
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc00011b320, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500600, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:962 +0x8b3 fp=0xc048738810 sp=0xc048738448 pc=0x156e953
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc00011b320, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:710 +0x48b fp=0xc048738af8 sp=0xc048738810 pc=0x156d7eb
github.com/cockroachdb/cockroach/pkg/kv.(*txnLockGatekeeper).SendLocked(0xc038e6f190, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:234 +0xe8 fp=0xc048738bb0 sp=0xc048738af8 pc=0x157c9e8
github.com/cockroachdb/cockroach/pkg/kv.(*txnMetricRecorder).SendLocked(0xc038e6f158, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_metric_recorder.go:51 +0xa2 fp=0xc048738c68 sp=0xc048738bb0 pc=0x1587b22
github.com/cockroachdb/cockroach/pkg/kv.(*txnCommitter).SendLocked(0xc038e6f148, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_committer.go:44 +0x219 fp=0xc048738d28 sp=0xc048738c68 pc=0x1583229
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc038e6f0c8, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:160 +0x83 fp=0xc048738e08 sp=0xc048738d28 pc=0x158a433
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).SendLocked(0xc038e6f0c8, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:101 +0xf9 fp=0xc048738ef8 sp=0xc048738e08 pc=0x1589f09
github.com/cockroachdb/cockroach/pkg/kv.(*txnPipeliner).SendLocked(0xc038e6f040, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_pipeliner.go:182 +0xf9 fp=0xc048739010 sp=0xc048738ef8 pc=0x1587fe9
github.com/cockroachdb/cockroach/pkg/kv.(*txnIntentCollector).SendLocked(0xc038e6f000, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_intent_collector.go:101 +0x43d fp=0xc0487391a0 sp=0xc048739010 pc=0x15860dd
github.com/cockroachdb/cockroach/pkg/kv.(*txnSeqNumAllocator).SendLocked(0xc038e6efe8, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_seq_num_allocator.go:92 +0x23b fp=0xc0487392d0 sp=0xc0487391a0 pc=0x1589ccb
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).SendLocked(0xc038e6ef38, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:224 +0x27b fp=0xc048739438 sp=0xc0487392d0 pc=0x1583b9b
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc038e6ed80, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:762 +0x591 fp=0xc048739788 sp=0xc048739438 pc=0x157f791
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc0005ae300, 0x39d80a0, 0xc02c3bef90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:622 +0x119 fp=0xc048739858 sp=0xc048739788 pc=0x116e2f9
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc019f01e60, 0x39d80a0, 0xc02c3bef90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:791 +0x13c fp=0xc0487399b8 sp=0xc048739858 pc=0x117854c
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send-fm(0x39d80a0, 0xc02c3bef90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:485 +0x76 fp=0xc048739a70 sp=0xc0487399b8 pc=0x117c266
github.com/cockroachdb/cockroach/pkg/internal/client.sendAndFill(0x39d80a0, 0xc02c3bef90, 0xc048739bd8, 0xc035c10a00, 0xc025ea7b00, 0xc02c199640)
/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:547 +0xeb fp=0xc048739ba8 sp=0xc048739a70 pc=0x116dd1b
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Run(0xc019f01e60, 0x39d80a0, 0xc02c3bef90, 0xc035c10a00, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:485 +0xcf fp=0xc048739bf8 sp=0xc048739ba8 pc=0x1176a7f
github.com/cockroachdb/cockroach/pkg/sql.(*tableWriterBase).finalize(0xc027700a28, 0x39d80a0, 0xc02c3bef90, 0xc0443d7400, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/tablewriter.go:164 +0xed fp=0xc048739c38 sp=0xc048739bf8 pc=0x20de34d
github.com/cockroachdb/cockroach/pkg/sql.(*tableUpdater).finalize(0xc027700a28, 0x39d80a0, 0xc02c3bef90, 0xc035a85600, 0xc009b8f450, 0xc037bec900, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/tablewriter_update.go:68 +0x48 fp=0xc048739c78 sp=0xc048739c38 pc=0x20e10d8
github.com/cockroachdb/cockroach/pkg/sql.(*updateNode).BatchedNext(0xc027700a00, 0x39d80a0, 0xc02c3bef90, 0xc035a85600, 0xc009b8f450, 0xc02c3bf080, 0x0, 0x39d80a0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/update.go:572 +0x23e fp=0xc048739ce0 sp=0xc048739c78 pc=0x20f322e
github.com/cockroachdb/cockroach/pkg/sql.(*serializeNode).Next(0xc02c3bec90, 0x39d80a0, 0xc02c3bef90, 0xc035a85600, 0xc009b8f450, 0x0, 0xc02c3bf080, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_batch.go:123 +0xa1 fp=0xc048739d30 sp=0xc048739ce0 pc=0x2086cd1
github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Next(0xc0207b2000, 0xc009b8f450, 0x39d9e60, 0xc02c3bec90, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:178 +0x447 fp=0xc048739e60 sp=0xc048739d30 pc=0x2089b87
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x39d80a0, 0xc02c3bef90, 0x39e38a0, 0xc0207b2000, 0x39c9520, 0xc031fc0700)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:175 +0x35 fp=0xc048739ea0 sp=0xc048739e60 pc=0x1da2205
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc0207b2000, 0x39d80a0, 0xc02c3bef90)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92 fp=0xc048739ee0 sp=0xc048739ea0 pc=0x1dd2df2
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc02ab5c700, 0x39d80a0, 0xc02c3bef90, 0x33d6c08, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:624 +0x1ec fp=0xc048739f48 sp=0xc048739ee0 pc=0x1db24dc
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc000175040, 0xc0276d4180, 0xc019f01e60, 0xc04873a898, 0xc049f1c2c0, 0xc009b8f510, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:252 +0x8ad fp=0xc04873a788 sp=0xc048739f48 pc=0x200944d
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc000175040, 0x39d80a0, 0xc02c3be4b0, 0xc009b8f510, 0xc0276d4180, 0xc019f01e60, 0x39d9ca0, 0xc02eced680, 0xc049f1c2c0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:839 +0x227 fp=0xc04873aa48 sp=0xc04873a788 pc=0x200cb37
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc009b8f100, 0x39d80a0, 0xc02c3be4b0, 0xc009b8f450, 0x3, 0x7f90dc2d9940, 0xc02f455760, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1100 +0x283 fp=0xc04873aad8 sp=0xc04873aa48 pc=0x1fb3ec3
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc009b8f100, 0x39d80a0, 0xc02c3be4b0, 0xc009b8f450, 0x7f90dc2d9940, 0xc02f455760, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:945 +0x658 fp=0xc04873abd8 sp=0xc04873aad8 pc=0x1fb2dc8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc009b8f100, 0x39d80a0, 0xc02c3be4b0, 0x39de6e0, 0xc03aa47030, 0xc04f122bdd, 0xe4, 0x0, 0xc013029720, 0x2, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:457 +0xdd4 fp=0xc04873b3b8 sp=0xc04873abd8 pc=0x1faeeb4
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc009b8f100, 0x39d80a0, 0xc02c3be4b0, 0x39de6e0, 0xc03aa47030, 0xc04f122bdd, 0xe4, 0x0, 0xc013029720, 0x2, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:103 +0x610 fp=0xc04873b660 sp=0xc04873b3b8 pc=0x1fadc60
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc009b8f100, 0x39d7fe0, 0xc011a9e940, 0xc0006952f8, 0x5400, 0x15000, 0xc000695390, 0xc0117fa4f0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1237 +0x140b fp=0xc04873bec8 sp=0xc04873b660 pc=0x1fa3d3b
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000b2e000, 0x39d7fe0, 0xc011a9e940, 0xc009b8f100, 0x5400, 0x15000, 0xc000695390, 0xc0117fa4f0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:429 +0xce fp=0xc04873bf28 sp=0xc04873bec8 pc=0x1f9f6be
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000b2e000, 0xc040479600, 0xc0110b85d4, 0xc0117fa4f0, 0x39d7fe0, 0xc011a9e940, 0xc009b8f100, 0x5400, 0x15000, 0xc000695390, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:332 +0xcf fp=0xc04873bf88 sp=0xc04873bf28 pc=0x22201bf
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc04873bf90 sp=0xc04873bf88 pc=0x75aeb1
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:315 +0x1040
The goroutine dump shows that everyone and their dog are in runtime.gopark
.
The machines here are n1-highcpu-16
which have ~14424mb of ram, so judging from the last stats printout we're not even running out of memory:
I190313 06:10:26.345689 356 server/status/runtime.go:464 [n2] runtime stats: 6.0 GiB RSS, 13330 goroutines, 582 MiB/271 MiB/1.4 GiB GO alloc/idle/total, 4.0 GiB/4.8 GiB CGO alloc/total, 194757.3 CGO/sec, 905.2/160.2 %(u/s)time, 0.0 %gc (7x), 91 MiB/116 MiB (r/w)net
panicparse
is unfortunately not good at categorizing these, it only ever eats the first line of each stack trace, not sure why. I'll spend a few minutes trying to figure out why it does that, that'll be much easier than hand combing 13k goroutines.
Ah, needed to sed -e 's/ fp=.*//g'
(this isn't a panic dump, so it has some extra info that trips up pp).
The goroutines with notable multiplicity are
Routine
1707: chan receive [8~10 minutes] [Created by server.go:678 (*Server).serveStreams.func1({[] [] false})]
Stack
- proc.go:302 gopark({[{54378816 #90} {824748121496 *} {5901 *} {3 }] [] false})
- proc.go:308 goparkunlock({[{824748121496 *} {5901 } {3 }] [] false})
- chan.go:520 chanrecv({[{824748121408 *} {824947137008 *} {1 } {824947136736 *}] [] false})
- chan.go:402 chanrecv1({[{824748121408 *} {824947137008 *}] [] false})
- replica_rangefeed.go:244 (*Replica).RangeFeed({[{825021269120 *} {825014300896 *} {60784256 #147} {824829156944 *} {53000873 #66} {20 } {60770336 #144} {824642466016 #1468} {0 }] [] false})
- store.go:3043 (*Store).RangeFeed({[{824651448320 #2475} {825014300896 *} {60784256 #147} {824829156944 *} {0 }] [] false})
- stores.go:210 (*Stores).RangeFeed({[{824640703456 #1421} {825014300896 *} {60784256 #147} {824829156944 *} {16 }] [] false})
- node.go:1089 (*Node).RangeFeed({[{824646803456 #1856} {825014300896 *} {60784256 #147} {824829156944 *} {824646803456 #1856} {824726518816 #4322}] [] false})
- api.pb.go:9396 _Internal_RangeFeed_Handler({[{52463840 #56} {824646803456 #1856} {60771008 #145} {824771164000 *} {7383496 } {32 }] [] false})
- context.go:239 NewServerWithInterceptor.func3({[{52463840 #56} {824646803456 #1856} {60771008 #145} {824771164000 *} {824839464896 *} {54355744 #75} {60657056 #127} {824635244736 #490}] [] false})
- server.go:1167 (*Server).processStreamingRPC({[{824643576256 #1623} {60820256 #155} {824727666688 #4588} {824723819776 *} {824644294416 #1676} {90820224 #166} {0 } {0 } {0 }] [] false})
- server.go:1253 (*Server).handleStream({[{824643576256 #1623} {60820256 #155} {824727666688 #4588} {824723819776 *} {0 }] [] false})
- server.go:680 (*Server).serveStreams.func1.1({[{824637933904 #1178} {824643576256 #1623} {60820256 #155} {824727666688 #4588} {824723819776 *}] [] false})
- asm_amd64.s:1333 goexit({[] [] false})
666: IO wait [Created by net.go:136 (*Server).ServeWith({[] [] false})]
Stack
- proc.go:302 gopark({[{54378768 #89} {140260017864928 *} {824634186498 *} {5 }] [] false})
- netpoll.go:366 netpollblock({[{140260017864888 *} {114 } {4096 }] [] false})
- netpoll.go:173 runtime_pollWait({[{140260017864888 *} {114 } {825330931360 *}] [] false})
- fd_poll_runtime.go:85 (*pollDesc).wait({[{824851198232 *} {114 } {18446744073709551360 } {60448256 #104} {91100784 #170}] [] false})
- fd_poll_runtime.go:90 (*pollDesc).waitRead({[{824851198232 *} {824970498048 *} {4096 } {4096 }] [] false})
- fd_unix.go:169 (*FD).Read({[{824851198208 *} {824970498048 *} {4096 } {4096 } {0 } {0 } {0 }] [] false})
- fd_unix.go:202 (*netFD).Read({[{824851198208 *} {824970498048 *} {4096 } {4096 } {0 } {825330931672 *} {8954872 }] [] false})
- net.go:177 (*conn).Read({[{824957158528 *} {824970498048 *} {4096 } {4096 } {0 } {0 } {0 }] [] false})
- cmux.go:218 (*MuxConn).Read({[{824727436640 *} {824970498048 *} {4096 } {4096 } {0 } {0 } {0 }] [] false})
- conn.go:1622 (*readTimeoutConn).Read({[{824638135040 *} {824970498048 *} {4096 } {4096 } {0 } {0 } {0 }] [] false})
- bufio.go:100 (*Reader).fill({[{824747410504 *}] [] false})
- bufio.go:242 (*Reader).ReadByte({[{824747410504 *} {824747411984 *} {3418157275668 *} {95255200 *}] [] false})
- encoding.go:120 (*ReadBuffer).ReadTypedMsg({[{824747412264 *} {60544608 #117} {824747410504 *} {60519136 #115} {95595432 #315} {0 } {0 }] [] false})
- conn.go:346 (*conn).serveImpl({[{824747410432 *} {60653536 #122} {824723358080 *} {825000127392 *} {824645443584 #1847} {21504 } {86016 } {824640623504 #1418} {824635691232 #494} {0 }] [] true})
- conn.go:168 serveConn({[{60653536 #122} {824843083712 *} {60803072 #153} {824727436640 *} {824724590601 *} {4 } {824749649376 *} {60513376 #113} {824777849360 *} {16384 }] [] true})
- server.go:518 (*Server).ServeConn({[{824640623104 #1416} {60653536 #122} {824843083712 *} {60803072 #153} {824727436640 *} {0 } {0 }] [] false})
- server.go:1712 (*Server).Start.func20.1({[{60803072 #153} {824727436640 *}] [] false})
- net.go:139 (*Server).ServeWith.func1({[{824635691232 #494} {60653728 #2} {824650132720 #2223} {824639488760 #1387} {60803072 #153} {824727436640 *} {824650132768 #2224}] [] false})
- asm_amd64.s:1333 goexit({[] [] false})
Routine
1504: select [0~9 minutes] [Created by errgroup.go:54 (*Group).Go({[] [] false})]
Stack
- proc.go:302 gopark({[{54378880 #1} {0 } {6153 } {1 }] [] false})
- select.go:313 selectgo({[{824841472296 *} {824841472224 *} {2 } {0 } {1 *}] [] false})
- context.go:470 rangeFeedClientAdapter.Recv({[{60653536 #122} {824826666560 *} {824957838752 *} {824957838848 *} {16 *} {824969093792 *} {825016820944 *}] [] false})
- <autogenerated>:1 (*rangeFeedClientAdapter).Recv({[{824826616480 *} {824841472632 *} {2 } {0 }] [] false})
- dist_sender_rangefeed.go:241 (*DistSender).singleRangeFeed({[{824634880800 #486} {60653536 #122} {825016820928 *} {1552456829945648143 #106722} {0 } {0 } {0 } {0 } {0 } {0 }] [] true})
- dist_sender_rangefeed.go:146 (*DistSender).partialRangeFeed({[{824634880800 #486} {60653536 #122} {825016820928 *} {1552456829945648143 #106722} {0 } {0 } {0 } {0 } {0 } {0 }] [] true})
- dist_sender_rangefeed.go:67 (*DistSender).RangeFeed.func1.1({[{60653536 #122} {825016820928 *} {60653536 *} {824650837904 *}] [] false})
- ctxgroup.go:170 Group.GoCtx.func1({[{824633720840 *} {54379232 #91}] [] false})
- errgroup.go:57 (*Group).Go.func1({[{825021767776 *} {824835422304 *}] [] false})
- asm_amd64.s:1333 goexit({[] [] false})
Routine
1504: chan receive [8~10 minutes] [Created by context.go:522 internalClientAdapter.RangeFeed({[] [] false})]
Stack
- proc.go:302 gopark({[{54378816 #90} {824723762744 *} {5901 *} {3 }] [] false})
- proc.go:308 goparkunlock({[{824723762744 *} {5901 } {3 }] [] false})
- chan.go:520 chanrecv({[{824723762656 *} {824837369248 *} {1 } {824837368976 *}] [] false})
- chan.go:402 chanrecv1({[{824723762656 *} {824837369248 *}] [] false})
- replica_rangefeed.go:244 (*Replica).RangeFeed({[{824760885248 *} {825013584352 *} {60797312 #148} {824722335904 *} {53000873 #66} {20 } {60770336 #144} {824642466016 #1468} {0 }] [] false})
- store.go:3043 (*Store).RangeFeed({[{824651448320 #2475} {825013584352 *} {60797312 #148} {824722335904 *} {0 }] [] false})
- stores.go:210 (*Stores).RangeFeed({[{824640703456 #1421} {825013584352 *} {60797312 #148} {824722335904 *} {824729458544 *}] [] false})
- node.go:1089 (*Node).RangeFeed({[{824646803456 #1856} {825013584352 *} {60797312 #148} {824722335904 *} {824729458552 *} {22427082 *}] [] false})
- context.go:524 internalClientAdapter.RangeFeed.func1({[{824826462640 *} {60471456 #106} {824646803456 #1856} {825013584352 *} {60653536 #122} {824819689024 *} {824723762464 *} {824723762560 *}] [] false})
- asm_amd64.s:1333 goexit({[] [] false})
3211: select [0~9 minutes] [Created by stopper.go:320 (*Stopper).RunAsyncTask({[] [] false})]
Stack
- proc.go:302 gopark({[{54378880 #1} {0 } {6153 } {1 }] [] false})
- select.go:313 selectgo({[{824989961728 *} {824989961668 *} {3 } {0 } {1 *}] [] false})
- registry.go:184 (*registration).outputLoop({[{824763442272 *} {60653536 #122} {824947025984 *} {824947025984 *} {825019916352 *}] [] false})
- registry.go:201 (*registration).runOutputLoop({[{824763442272 *} {60653536 #122} {824947025984 *}] [] false})
- processor.go:220 (*Processor).Start.func1.1({[{60653536 #122} {824947025728 *}] [] false})
- stopper.go:325 (*Stopper).RunAsyncTask.func1({[{824635691232 #494} {60653536 #122} {824947025728 *} {824770253952 *} {30 } {0 } {0 } {825005284864 *}] [] false})
- asm_amd64.s:1333 goexit({[] [] false})
641: sync.Cond.Wait [Created by conn.go:315 (*conn).serveImpl({[] [] false})]
Stack
- proc.go:302 gopark({[{54378816 #90} {824723358424 *} {6675 *} {3 }] [] false})
- proc.go:308 goparkunlock({[{824723358424 *} {6675 } {3 }] [] false})
- sema.go:510 runtime_notifyListWait({[{824723358416 *} {1660 *}] [] false})
- cond.go:56 (*Cond).Wait({[{824723358400 *}] [] false})
- conn_io.go:429 (*StmtBuf).curCmd({[{824747414288 *} {0 } {0 } {0 } {0 } {0 }] [] false})
- conn_executor.go:1143 (*connExecutor).run({[{824822563712 *} {60653536 #122} {824739690304 *} {824640623352 #1417} {21504 } {86016 } {824640623504 #1418} {825045412464 *} {0 } {0 }] [] false})
- conn_executor.go:429 (*Server).ServeConn({[{824645443584 #1847} {60653536 #122} {824739690304 *} {824822563712 *} {21504 } {86016 } {824640623504 #1418} {825045412464 *} {0 } {0 }] [] false})
- conn.go:332 (*conn).serveImpl.func4({[{824645443584 #1847} {824747412736 *} {825044716048 *} {825045412464 *} {60653536 #122} {824739690304 *} {824822563712 *} {21504 } {86016 } {824640623504 #1418}] [] true})
- asm_amd64.s:1333 goexit({[] [] false})
Anyway, what this looks like is that the moment CDC started doing its thing things went south, though it's unclear to me how they went south exactly.
There's a GC in process, and look at that idle time --
goroutine 2 [force gc (idle), 57 minutes]:
runtime.gopark(0x33dc140, 0x5acf5c0, 0x1410, 0x1)
/usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc00008af80 sp=0xc00008af60 pc=0x72c6db
runtime.goparkunlock(0x5acf5c0, 0x1410, 0x1)
/usr/local/go/src/runtime/proc.go:308 +0x53 fp=0xc00008afb0 sp=0xc00008af80 pc=0x72c783
runtime.forcegchelper()
/usr/local/go/src/runtime/proc.go:251 +0xb3 fp=0xc00008afe0 sp=0xc00008afb0 pc=0x72c553
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00008afe8 sp=0xc00008afe0 pc=0x75aeb1
created by runtime.init.4
/usr/local/go/src/runtime/proc.go:240 +0x35
goroutine 3 [GC sweep wait]:
runtime.gopark(0x33dc140, 0x5ad60a0, 0x75140c, 0x1)
/usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc00008b780 sp=0xc00008b760 pc=0x72c6db
runtime.goparkunlock(0x5ad60a0, 0x398140c, 0x1)
/usr/local/go/src/runtime/proc.go:308 +0x53 fp=0xc00008b7b0 sp=0xc00008b780 pc=0x72c783
runtime.bgsweep(0xc0000b4000)
/usr/local/go/src/runtime/mgcsweep.go:71 +0x102 fp=0xc00008b7d8 sp=0xc00008b7b0 pc=0x71e352
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00008b7e0 sp=0xc00008b7d8 pc=0x75aeb1
created by runtime.gcenable
/usr/local/go/src/runtime/mgc.go:216 +0x58
goroutine 4 [finalizer wait, 57 minutes]:
runtime.gopark(0x33dc140, 0x5b2ab08, 0x140f, 0x1)
/usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc00008bf28 sp=0xc00008bf08 pc=0x72c6db
runtime.goparkunlock(0x5b2ab08, 0x33d140f, 0x1)
/usr/local/go/src/runtime/proc.go:308 +0x53 fp=0xc00008bf58 sp=0xc00008bf28 pc=0x72c783
runtime.runfinq()
/usr/local/go/src/runtime/mfinal.go:175 +0x99 fp=0xc00008bfe0 sp=0xc00008bf58 pc=0x715c69
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00008bfe8 sp=0xc00008bfe0 pc=0x75aeb1
created by runtime.createfing
/usr/local/go/src/runtime/mfinal.go:156 +0x61
but all the GC workers are idle (there are like a dozen of those)
goroutine 11 [GC worker (idle)]:
runtime.gopark(0x33dbff0, 0xc000494020, 0x1417, 0x0)
/usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc00049af60 sp=0xc00049af40 pc=0x72c6db
runtime.gcBgMarkWorker(0xc00007ca00)
/usr/local/go/src/runtime/mgc.go:1772 +0xfc fp=0xc00049afd8 sp=0xc00049af60 pc=0x71955c
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00049afe0 sp=0xc00049afd8 pc=0x75aeb1
created by runtime.gcBgMarkStartWorkers
/usr/local/go/src/runtime/mgc.go:1720 +0x77
approx 20 goroutines are stuck waiting for the Go rocksdb sync loop (none for long). The handful of goroutines in cgo seemed runnable.
The high memory usage is probably a symptom of what I just now discovered and wrote up in https://github.com/cockroachdb/cockroach/issues/35114#issuecomment-472506689. Summary is that we end up in a loop unable to finish a catchup scan, emitting the same events over and over and the changefeed can't keep up.
FYI RangeFeeds currently have 2 goroutines per range server side and 1 per range client side (or something like that, it's definitely O(ranges)), so goroutines are expected to be high. Haven't looked specifically here if that's what's going on, but could be expected.
SHA: https://github.com/cockroachdb/cockroach/commits/04ef15974085e14f758b20c552a84052eac9fa2b
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1180753&tab=buildLog
The test failed on master:
cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1180753-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
0 19.5 0.0 0.0 0.0 0.0 newOrder
3m36s 1269 0.0 3.1 0.0 0.0 0.0 0.0 orderStatus
3m36s 1269 0.0 27.4 0.0 0.0 0.0 0.0 payment
3m36s 1269 0.0 3.1 0.0 0.0 0.0 0.0 stockLevel
E190316 07:09:09.115193 1 workload/cli/run.go:420 error in payment: dial tcp 10.142.0.156:26257: connect: connection refused
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
3m37s 1617 0.0 2.9 0.0 0.0 0.0 0.0 delivery
3m37s 1617 0.0 19.4 0.0 0.0 0.0 0.0 newOrder
3m37s 1617 0.0 3.1 0.0 0.0 0.0 0.0 orderStatus
3m37s 1617 0.0 27.3 0.0 0.0 0.0 0.0 payment
3m37s 1617 0.0 3.1 0.0 0.0 0.0 0.0 stockLevel
: signal: killed
test.go:1202: test timed out (11h49m5.997039001s)
cluster.go:1626,cdc.go:213,cdc.go:368,test.go:1214: unexpected node event: 3: dead
test.go:978,asm_amd64.s:523,panic.go:513,log.go:219,cluster.go:926,context.go:90,cluster.go:916,test.go:1159,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1626,cdc.go:213,cdc.go:368,test.go:1214: write /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190316-1180753/cdc/tpcc-1000/rangefeed=true/test.log: file already closed
SHA: https://github.com/cockroachdb/cockroach/commits/c2939ec9a4f15b7fb8683a5805deeb241953e7aa
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1182991&tab=buildLog
The test failed on master:
cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1182991-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
0.0 0.0 0.0 0.0 0.0 delivery
15m41s 1408 0.0 0.0 0.0 0.0 0.0 0.0 newOrder
15m41s 1408 0.0 0.5 0.0 0.0 0.0 0.0 orderStatus
15m41s 1408 0.0 0.0 0.0 0.0 0.0 0.0 payment
15m41s 1408 0.0 0.0 0.0 0.0 0.0 0.0 stockLevel
15m42s 1507 0.0 0.0 0.0 0.0 0.0 0.0 delivery
15m42s 1507 0.0 0.0 0.0 0.0 0.0 0.0 newOrder
15m42s 1507 0.0 0.5 0.0 0.0 0.0 0.0 orderStatus
15m42s 1507 0.0 0.0 0.0 0.0 0.0 0.0 payment
15m42s 1507 0.0 0.0 0.0 0.0 0.0 0.0 stockLevel
E190318 07:49:40.680374 1 workload/cli/run.go:420 error in payment: dial tcp 10.142.0.73:26257: connect: connection refused
: signal: killed
cluster.go:1626,cdc.go:213,cdc.go:368,test.go:1214: dial tcp 35.231.94.136:26257: connect: connection refused
Hey, what happened two up (node 3 dead)? No node logs are present, neither fetched nor included in debug zip. The test.log
indicates an attempt to get them wasn't even made. The test ran for 12h so I assume the cluster expired or something like that. Wish that were clearer (i.e. hope I get around to improving that UX someday).
The last test failure has logs, but the node seems to have just gotten killed, much earlier in. Probably ran out of memory as these tests to these days.
The last failure has all kinds of crazy stuff: TONS of splits in the logs, node liveness failures, waiting on a latch for 1 minute, go memory usage growth, more than 1000 occurrences of publish (1 leases): desc=[{order_line 60 1}]
. It does look like the memory usage was ultimately what did it in
SHA: https://github.com/cockroachdb/cockroach/commits/f95d45653df6be5587fb9887de241f50b6932000
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1185396&tab=buildLog
The test failed on master:
cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1185396-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
l
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
25s 0 0.0 19.7 0.0 0.0 0.0 0.0 delivery
25s 0 100.1 61.0 6710.9 6979.3 6979.3 6979.3 newOrder
25s 0 0.0 21.1 0.0 0.0 0.0 0.0 orderStatus
25s 0 42.0 200.0 21474.8 22548.6 22548.6 22548.6 payment
25s 0 0.0 19.9 0.0 0.0 0.0 0.0 stockLevel
26s 0 0.0 18.9 0.0 0.0 0.0 0.0 delivery
26s 0 268.7 69.0 7784.6 8053.1 8053.1 8053.1 newOrder
26s 0 0.0 20.3 0.0 0.0 0.0 0.0 orderStatus
26s 0 24.0 193.2 22548.6 23622.3 23622.3 23622.3 payment
26s 0 0.0 19.1 0.0 0.0 0.0 0.0 stockLevel
: signal: killed
cluster.go:1626,cdc.go:213,cdc.go:368,test.go:1214: unexpected status: failed
SHA: https://github.com/cockroachdb/cockroach/commits/25398c010b2af75b11fed189680ea6b9645f0cf5
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1199659&tab=buildLog
The test failed on master:
cluster.go:1193,cdc.go:66,cdc.go:376,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1199659-cdc-tpcc-1000-rangefeed-true:1-3 --encrypt returned:
stderr:
stdout:
LE_RPC_COMPRESSION=false ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-east1,zone=us-east1-b --enterprise-encryption=path=/mnt/data1/cockroach,key=/mnt/data1/cockroach/aes-128.key,old-key=plain >> ${HOME}/logs/cockroach.stdout 2>> ${HOME}/logs/cockroach.stderr || (x=$?; cat ${HOME}/logs/cockroach.stderr; exit $x)
Connection to 34.73.99.17 closed by remote host.
github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:397
github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1320
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1333:
I190326 06:27:40.520369 1 cluster_synced.go:1402 command failed
: exit status 1
last one failed in roachprod start
SHA: https://github.com/cockroachdb/cockroach/commits/a1e6e9decc9dec15a32bbb6d30efc67ca45a532a
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1204585&tab=buildLog
The test failed on release-19.1:
cluster.go:1267,cdc.go:633,cdc.go:133,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1204585-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
l
2m16s 0 17.0 20.2 92.3 151.0 151.0 151.0 delivery
2m16s 0 237.7 196.9 56.6 151.0 176.2 209.7 newOrder
2m16s 0 23.0 21.2 12.1 22.0 54.5 54.5 orderStatus
2m16s 0 249.7 214.1 27.3 159.4 192.9 201.3 payment
2m16s 0 22.0 21.2 130.0 218.1 243.3 243.3 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
2m17s 0 14.0 20.1 96.5 130.0 134.2 134.2 delivery
2m17s 0 206.0 197.0 56.6 121.6 142.6 151.0 newOrder
2m17s 0 26.0 21.2 11.5 39.8 48.2 48.2 orderStatus
2m17s 0 213.0 214.0 30.4 100.7 117.4 121.6 payment
2m17s 0 18.0 21.2 125.8 285.2 285.2 285.2 stockLevel
: signal: killed
cluster.go:1626,cdc.go:221,cdc.go:376,test.go:1216: unexpected status: failed
SHA: https://github.com/cockroachdb/cockroach/commits/3aadd20bbf0940ef65f8b2cdcda498401ba5d9c6
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1206925&tab=buildLog
The test failed on release-19.1:
cluster.go:1293,cdc.go:633,cdc.go:133,cluster.go:1631,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1206925-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
l
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
33s 0 29.0 23.7 83.9 142.6 151.0 151.0 delivery
33s 0 182.0 163.4 48.2 125.8 142.6 159.4 newOrder
33s 0 20.0 23.2 8.4 35.7 39.8 39.8 orderStatus
33s 0 247.0 238.5 30.4 104.9 121.6 125.8 payment
33s 0 32.0 25.2 104.9 159.4 159.4 159.4 stockLevel
34s 0 19.0 23.6 83.9 142.6 151.0 151.0 delivery
34s 0 150.0 163.0 52.4 109.1 121.6 130.0 newOrder
34s 0 22.0 23.1 9.4 19.9 27.3 27.3 orderStatus
34s 0 245.9 238.7 31.5 88.1 109.1 142.6 payment
34s 0 32.0 25.4 113.2 167.8 318.8 318.8 stockLevel
: signal: killed
cluster.go:1652,cdc.go:221,cdc.go:376,test.go:1223: unexpected status: failed
Failure is from an error that's not in the whitelist. We haven't backported the switch to a blacklist to release 19.1 yet #36132
SHA: https://github.com/cockroachdb/cockroach/commits/668162cc99e4f3198b663b1abfa51858eeb3ccb8
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212251&tab=buildLog
The test failed on master:
cluster.go:1071,cdc.go:65,cdc.go:376,test.go:1223: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod put teamcity-1212251-cdc-tpcc-1000-rangefeed-true:4 /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload ./workload returned:
stderr:
stdout:
teamcity-1212251-cdc-tpcc-1000-rangefeed-true: putting (dist) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload ./workload
........................................................................................................................
4: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload root@35.231.237.7:./workload
Connection closed by 35.231.237.7 port 22
lost connection
: exit status 1
I190401 06:26:06.922832 1 cluster_synced.go:927 put /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload failed
: exit status 1
SHA: https://github.com/cockroachdb/cockroach/commits/a039a93a5cc6eb3f395ceb6f7dc8030985dccc29
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212269&tab=buildLog
The test failed on release-19.1:
cdc.go:750,cdc.go:223,cdc.go:376,test.go:1223: initial scan did not complete
initial scan did not complete
well that's a bit concerning
SHA: https://github.com/cockroachdb/cockroach/commits/2851c7d56ee4966109691b5c48c73ec8d4cc9847
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1215354&tab=buildLog
The test failed on master:
cdc.go:859,cdc.go:224,cdc.go:473,test.go:1226: initial scan did not complete
Finally got a chance to write up my triage of that initial scan failure. Looks like at least one span got stuck. Log from the end of the test:
ccl/changefeedccl/changefeed_processors.go:567 [n1] job 439196333105479681 span /Table/57/1/477/4/-969{-/0} is behind by 2h41m15.2891246s
One odd thing is that the first "behind by" message for this span is
I190401 07:08:59.781789 54191 ccl/changefeedccl/changefeed_processors.go:567 [n1] job 439196333105479681 span /Table/57/1/477/4/-969{-/0} is behind by 40m55.69528427s
but the changefeed was started with a cursor set to "2019-04-01 07:08:03.037043359". There's not a ton we can do based on this, without knowing what got stuck. Discussed some additional logging to add with @tbg this morning in our 1:1. PR coming when I get the cycles.
Dunno if this is related, but there are a few "slow heartbeat took" and "failed node liveness heartbeat" logs from when the changefeed starts up (probably the TBI-free initial scans). Maybe we need to adjust the default limit for how many we allow at once.
Also we see tons of health alerts for goroutines, but these are sadly expected with rangefeed-based changefeeds. Probably something to tune here
W190401 07:14:10.240815 458 server/node.go:880 [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:sys.goroutines Value:10092 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Looked at the failure from today. On this one, some of the watched ranges never make it to a non-zero resolved timestamp. Quite interestingly, the range in the message changes over time. The place it comes from spanFrontier
sorts first by time, then by start key, so it looks like things are unblocking very slowly (or the ranges are splitting/merging, but this doesn't seem to be the case from the logs).
SHA: https://github.com/cockroachdb/cockroach/commits/1cbf3680129e47bd310640bf32b665662f30faa9
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217781&tab=buildLog
The test failed on release-19.1:
cdc.go:867,cdc.go:225,cdc.go:481,test.go:1228: initial scan did not complete
SHA: https://github.com/cockroachdb/cockroach/commits/c6df752eefe4609b8a5bbada0955f79a2cfb790e
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217763&tab=buildLog
The test failed on master:
cdc.go:867,cdc.go:225,cdc.go:481,test.go:1228: initial scan did not complete
Well it seems like this is easy to repro. I wonder what changed
I wrote a new gauge to track how far behind the most behind closed timestamp is and ran the cdc/tpcc-1000 test. The changefeed is doing okay (end-to-end steady latency 16.065162494s; max steady latency so far 23.242592094s
), but the gauge is saying that at least one replica on each node is stuck. So presumably it's not one of the ranges being watched by the changefeed. cc @tbg is this expected?
Note I tested this with a local single node cluster and it was flat. I don't have a screenshot of this, but it also went down when I adjusted the closed timestamp interval down.
In progress PR for the gauge which needs some cleanup https://github.com/cockroachdb/cockroach/compare/master...danhhz:cdc_slowrange_log?expand=1
Hmm. I think the root problem here will have to do with the MLAI map. If we don't have data for a range (i.e. we haven't received any MLAI for it which we need to match, and perhaps now the range is idle and isn't sending new ones) then we'll get stuck like that. Typically replicas call .Request
to get out of a situation like that, but to be honest I'm struggling a bit to remember how they get in it in the first place. Any activity on a range leads to an update for that range to be sent to all connected peers, so early in the life of the cluster it's conceivable that we're not sending a few updates because the peer is not yet connected. If the range is dormant after that, no new updates will be sent, and so followers having missed the initial one would remain in that state until something calls .Request
(which only an incoming request or lease change, etc, will). The code is here:
I sent a PR https://github.com/cockroachdb/cockroach/pull/36523 to expose a bit more of the closed timestamp in the range status, which you may want to peruse. You can python-grab the lowest ActiveClosedTimestamp and then you know what range to look at.
In the meantime, perhaps @ajwerner or @nvanbenschoten remember when you're expected to need to call .Request
. It seems hard to guarantee that you never have to, but I'm not aware of common reasons to have to once the ct clients are fully connected between all nodes.
Ran another tpcc-1000 with your PR as well as my gauge and saw the following. Dunno what to make of it. I'm 95% sure that when I did this yesterday, it was a solid line for every node. This is still during IMPORT where yesterday I was looking during the CHANGEFEED, so that could be the difference. In fact, I vaguely recall that closedts resets during splits or rebalances or something, which certainly occur during IMPORT.
That's even weirder if you ask me. This looks like there's a replica most of the time that hasn't made progress "forever", but during the dips that replica isn't there. I'd expect that once we drop, the linear line has to start from that point onwards. Maybe this is about rebalancing? When a new replica is created, for a moment the active maxClosed
may be the lease start timestamp (which is basically not moving unless there's a lease transfer) (but why? Shouldn't there be a MLAI for that replica in almost all cases?) Are you running the test on >3 nodes, i.e. is there even any rebalancing? (I'd start on three nodes to narrow this down).
You could print the RangeStatus for the "most behind range" printf-style to get a better idea.
This is a 3 node cluster. The flopping didn't survive a restart to nodes with more logging, sadly. Now I'm seeing the original behavior. Looks like the most behind ranges are missing the newest_closed_timestamp
and indeed inheriting the lease start (1554392854243628195
for both below).
I190404 21:35:27.069208 297 storage/store.go:4198 [n1,s1] WIP minMaxClosedTS 1554392854.243628195,0 5h47m52.825401835s {"state":{"raft_applied_index":24,"lease_applied_index":8,"desc":{"range_id":256,"start_key":"xImpj/cFdA==","end_key":"xImpj/cFdIg=","replicas":[{"node_id":1,"store_id":1,"replica_id":1},{"node_id":3,"store_id":3,"replica_id":2},{"node_id":2,"store_id":2,"replica_id":3}],"next_replica_id":4,"generation":1},"lease":{"start":{"wall_time":1554392854243628195},"replica":{"node_id":3,"store_id":3,"replica_id":2},"proposed_ts":{"wall_time":1554392854243632815},"epoch":1,"sequence":2},"truncated_state":{"index":10,"term":5},"gc_threshold":{},"stats":{"contains_estimates":false,"last_update_nanos":1554392872454974468,"intent_age":0,"gc_bytes_age":0,"live_bytes":0,"live_count":0,"key_bytes":0,"key_count":0,"val_bytes":0,"val_count":0,"intent_bytes":0,"intent_count":0,"sys_bytes":338,"sys_count":5},"txn_span_gc_threshold":{},"using_applied_state_key":true},"last_index":24,"range_max_bytes":67108864,"newest_closed_timestamp":{"closed_timestamp":{}},"active_closed_timestamp":{"wall_time":1554392854243628195}}
I'd like to clarify something. In a healthy cluster, should this new metric I added be approx the closed timestamp interval? Aka should every replica on every node be getting timestamps closed under normal conditions? Or is that what you were talking about with the .Request
stuff? If so, does that need to be called once on each replica to kick stuff off or regularly?
SHA: https://github.com/cockroachdb/cockroach/commits/2eb3f60e6f88a4f19ec7472f010d20287488f71d
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1227248&tab=buildLog
The test failed on release-19.1:
cdc.go:869,cdc.go:225,cdc.go:481,test.go:1237: initial scan did not complete
1) "initial scan did not complete" here and above is actually a bug in the test. The initial scan did complete, but the roachtest only counts it as such once the resolved timestamp moves a second time. 2) I checked our new metric in the debug.zip. It says the range with the min resolved timestamp is indeed way back around when the test started. Sadly, this might be a false alarm because the range might not be one of the ones watched by the changefeed. However... 3) The range status stuff @tbg added does indicate that at the point the debug.zip was taken, at least some of the ranges had closed timestamps way behind because was missing from the node's closed timestamp storage. Range 1002, which is in table 58 and so definitely being watched by the changefeed:
"newest_closed_timestamp": {
"closed_timestamp": {}
},
"active_closed_timestamp": {
"wall_time": 1554704872573832927
}
{ "span": { "start_key": "/Table/58/2/963/10/\"ڻ\\r\\xb0NPD\\\"\\x97^o\\x9f\\x84\\xfc\\x17\\x9f\"/PrefixEnd", "end_key": "/Table/58/2/963/10/\"\\xda\\xca,po\\xd7@ҕW\\xb4\\xc8\\xf0\\xf6;\\x82\"/0" }, "raft_state": { "replica_id": 1, "hard_state": { "term": 10, "vote": 2, "commit": 21 }, "lead": 2, "state": "StateFollower", "applied": 21, "progress": null }, "state": { "state": { "raft_applied_index": 21, "lease_applied_index": 5, "desc": { "range_id": 1002, "start_key": "wor3A8OSEtq7DbBOUEQil15vn4T8F58AAg==", "end_key": "wor3A8OSEtrKLHBv10DSlVe0yPD2O4IAAYg=", "replicas": [ { "node_id": 1, "store_id": 1, "replica_id": 1 }, { "node_id": 2, "store_id": 2, "replica_id": 2 }, { "node_id": 3, "store_id": 3, "replica_id": 3 } ], "next_replica_id": 4, "generation": 1 }, "lease": { "start": { "wall_time": 1554704531959801761 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 2 }, "proposed_ts": { "wall_time": 1554713910279390153 }, "epoch": 2, "sequence": 3 }, "truncated_state": { "index": 10, "term": 5 }, "gc_threshold": {}, "stats": { "contains_estimates": false, "last_update_nanos": 1554704930825095362, "intent_age": 0, "gc_bytes_age": 0, "live_bytes": 0, "live_count": 0, "key_bytes": 0, "key_count": 0, "val_bytes": 0, "val_count": 0, "intent_bytes": 0, "intent_count": 0, "sys_bytes": 366, "sys_count": 4 }, "txn_span_gc_threshold": {}, "using_applied_state_key": true }, "last_index": 21, "raft_log_size": 2685, "raft_log_size_trusted": true, "range_max_bytes": 67108864, "newest_closed_timestamp": { "closed_timestamp": {} }, "active_closed_timestamp": { "wall_time": 1554704872573832927 } }, "source_node_id": 1, "source_store_id": 1, "lease_history": [ { "start": { "wall_time": 1554704531959801761 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 2 }, "proposed_ts": { "wall_time": 1554704531959804376 }, "epoch": 1, "sequence": 2 }, { "start": { "wall_time": 1554704531959801761 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 2 }, "proposed_ts": { "wall_time": 1554713910279390153 }, "epoch": 2, "sequence": 3 } ], "problems": {}, "stats": { "writes_per_second": 0.0006061913601019598 }, "latches_local": {}, "latches_global": {}, "lease_status": { "lease": { "start": { "wall_time": 1554704531959801761 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 2 }, "proposed_ts": { "wall_time": 1554713910279390153 }, "epoch": 2, "sequence": 3 }, "timestamp": { "wall_time": 1554713913181054479 }, "state": 1, "liveness": { "node_id": 2, "epoch": 2, "expiration": { "wall_time": 1554713922563714397, "logical": 0 } } }, "quiescent": true }
The logic regarding which epoch to use when publishing updates feels sketchy (https://github.com/cockroachdb/cockroach/blob/6ab548361d310481fc724ab9be70c165a0a4487d/pkg/storage/closedts/provider/provider.go#L171). It makes sense that the code does not want to publish a closed timestamp that straddles epochs (hence the logic to publish with the lastEpoch as opposed to the current epoch) but it seems problematic if, as in this case, an epoch change leads to a new lease which leads to a quiescent range emitting just one MLAI and then never again. If that whole thing straddles two calls to Close then the emitted MLAI will still be pinned to the old epoch and never observed under the new one.
Maybe we need to publish the entire state of the MLAI map on epoch changes?
cc @tbg
I’ll propose three solutions:
1) track epochs with MLAIs
Right now we’re guessing that the current values belong to the epoch that preceded the current call to close. We have access to the epoch whenever we publish an MLAI so it wouldn’t be difficult to do. The negatives would be the overhead of potentially tracking multiple maps in the mpt and possibly needing to send multiple updates per call to Close.
2) Publish a complete update with information from all ranges when epochs change
This heavy and like a pretty big refactor but could work. We’d need to ensure that these updates don’t get included in the previous epoch. I don’t have a precise vision of how I’d expose this but as a sketch it could work. It risks being chatty even epochs start flying by, but so too does 1).
3) Detect when a closed time stamp is falling behind and explicitly request it.
This doesn’t feel too crazy to implement and seems like it may work but is unsatisfying. This solution papers over problems with lost updates on the closed time stamp subsystem without solving them or forcing anybody to understand them. This may be the simplest path forward with the least invasive changes.
1) feels right to me because we only ever read MLAI relative to an epoch but 3) seems okay as a quick fix. Maybe we should do 3) and then supplement it with 1) and adds metrics/telemetry?
SHA: https://github.com/cockroachdb/cockroach/commits/57274fc2cd56e5e57957db59acbbcf4f35c118bc
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1229018&tab=buildLog
The test failed on master:
cluster.go:1329,cdc.go:748,cdc.go:135,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1229018-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} returned:
stderr:
stdout:
: signal: killed
cluster.go:1688,cdc.go:223,cdc.go:481,test.go:1237: unexpected node event: 2: dead
cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1688,cdc.go:223,cdc.go:481,test.go:1237: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1229018-cdc-tpcc-1000-rangefeed-true --oneshot --ignore-empty-nodes: exit status 1 4: skipped
2: dead
1: 3943
3: 4058
Error: 2: dead
So what's the exact scenario here? Is it something like this:
.Request
is needed at each peer.First of all, if there isn't any activity on the range, no MLAI will be emitted no matter what we do, and that will require an explicit call to .Request
(or acquiring a lease, mod the timing problem above) anyway: the followers will continue to use the old epoch (from the lease) but that old epoch will be rotated out of the closedts storage in due time. I think for RangeFeed we explicitly call redirectOnOrAcquireLease
to trigger "things". If we called .Request
immediately after that, wouldn't we avoid the issue above without any further changes? edit: no, the MLAI could just land in the wrong epoch as discussed above.
If we want to make it less likely that calls to Request
are necessary, associating epochs to each MLAI seems reasonable (so that the lease change would be associated to epoch 12), but do we even need all that? If we see MLAIs collected (over a closed timestamp Close period) with more than one epoch, won't we automatically know that those for the lower epoch are not going to be used? It seems that what we want to do in this code
is to ask the proposal to pass the epoch of the lease into the returned closure. Internally, we keep track of the max epoch seen and, if we see a higher epoch, we throw out the stuff we have so far (also, ignore anything new not matching the max epoch after). Then, when Close
is called, it returns an epoch that is associated to all entries in the MLAI map, and this is used instead in this code:
We still have to verify that the epoch is live at the closed timestamp.
First of all, if there isn't any activity on the range, no MLAI will be emitted no matter what we do, and that will require an explicit call to .Request (or acquiring a lease, mod the timing problem above) anyway: the followers will continue to use the old epoch (from the lease) but that old epoch will be rotated out of the closedts storage in due time. I think for RangeFeed we explicitly call redirectOnOrAcquireLease to trigger "things". If we called .Request immediately after that, wouldn't we avoid the issue above without any further changes? edit: no, the MLAI could just land in the wrong epoch as discussed above.
For the scenario in question, we have a RangeFeed established but then a change in node liveness epoch leads a new lease which does lead to an MLAI being produced but it ends up in the wrong Epoch given the logic for last Epoch.
is to ask the proposal to pass the epoch of the lease into the returned closure. Internally, we keep track of the max epoch seen and, if we see a higher epoch, we throw out the stuff we have so far (also, ignore anything new not matching the max epoch after). Then, when Close is called, it returns an epoch that is associated to all entries in the MLAI map, and this is used instead in this code:
This is a good idea! Much simpler than keeping all of the MLAIs corresponding to each Epoch. Do you see a reason not to make that change? I'd be happy to type it.
We could, in addition, defensively add code to explicitly Request
when we detect in RangeFeed that we're failing to ratchet the resolved timestamp. That call could happen either here or underneath inside the rangefeed.Processor.
We could, in addition, defensively add code to explicitly Request when we detect in RangeFeed that we're failing to ratchet the resolved timestamp. That call could happen either here or underneath inside the rangefeed.Processor.
This is probably the only change that we should make for 19.1 (assuming it can fix things, which it should if it runs at some cadence at least). Let's try that out first, just to wrap up 19.1
This is a good idea! Much simpler than keeping all of the MLAIs corresponding to each Epoch. Do you see a reason not to make that change? I'd be happy to type it.
Absolutely, please do! That change should be good unless there's some snag we're missing (in which case you'll find out when you try to do it, hopefully). Let's also think about scenarios in which this isn't enough unless the caller also calls .Request
. If the leaseholder changes, the new leaseholder calls EmitMLAI
so we should be good (assuming the epoch we're adding to that is the right one, but seems like we should be able to manage). Then the only other scenario is the one in which everything is quiescent and there just isn't a lease for any live nodes' epoch. I guess this could still happen to RangeFeed
(if it didn't call Request
periodically on failures to close out): it calls redirectOnOrAcquireLease
, but immediately after that the epoch for that lease times out, and the range is idle so nothing happens. We end up with a lease from a past epoch and nobody asking to get a new lease. Calling .Request
will fix that by re-emitting the MLAI with its updated epoch.
Fingers crossed this was closed by #36684
SHA: https://github.com/cockroachdb/cockroach/commits/46d1ab4c06edfd37d875114972c55b44105acd83
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1147903&tab=buildLog