cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.14k stars 3.81k forks source link

roachtest: cdc/tpcc-1000/rangefeed=true failed #35142

Closed cockroach-teamcity closed 5 years ago

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/46d1ab4c06edfd37d875114972c55b44105acd83

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1147903&tab=buildLog

The test failed on master:
    cluster.go:1226,cdc.go:612,cdc.go:124,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1147903-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
        l
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
            1m1s        0           23.0           20.9    130.0    671.1    771.8    771.8 delivery
            1m1s        0          221.2          169.0     83.9    520.1    604.0    771.8 newOrder
            1m1s        0           20.0           21.6     13.1     48.2     60.8     60.8 orderStatus
            1m1s        0          177.1          213.6     35.7    218.1    260.0    268.4 payment
            1m1s        0           20.0           21.5    125.8    209.7    234.9    234.9 stockLevel
            1m2s        0           20.0           20.9    134.2    226.5    251.7    251.7 delivery
            1m2s        0          238.8          170.1     83.9    167.8    209.7    260.0 newOrder
            1m2s        0           17.0           21.5     12.6     22.0     26.2     26.2 orderStatus
            1m2s        0          217.8          213.7     54.5    151.0    184.5    201.3 payment
            1m2s        0           29.0           21.6    176.2    218.1    243.3    243.3 stockLevel
        : signal: killed
    cluster.go:1585,cdc.go:212,cdc.go:367,test.go:1211: unexpected status: failed
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/d0a93d286ee42ceb94f986b6a06a1afd52cf914e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149020&tab=buildLog

The test failed on master:
    cluster.go:1226,cdc.go:612,cdc.go:124,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1149020-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
        Error: read tcp 10.142.0.61:56362->10.142.0.130:26257: read: connection reset by peer
        Error:  exit status 1
        : exit status 1
    cluster.go:1585,cdc.go:212,cdc.go:367,test.go:1211: Goexit() was called
tbg commented 5 years ago

^-- something bad is going on in this last one -- goroutine count on n1 explodes:

[n1] runtime stats: 5.6 GiB RSS, 8539 goroutines

Errors just before that in the logs:

190223 08:11:09.333526 116 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {teamcity-1149020-cdc-tpcc-1000-rangefeed-true-0001:26257 0 }. Err :connection error: desc = "transport: failed to write client preface: io: read/write on closed pipe". Reconnecting...

This happens multiple times and is odd because this node itself is -0001:26257.

Then this message is noteworthy because of the insane behind count:

I190223 08:11:15.731353 50034 ccl/changefeedccl/changefeed_processors.go:562 [n1] job 428733347495510017 span /Table/53/1/{0/1/-3001/1/0-1/10/-1031/8} is behind by 430808h11m15.72977394s

The goroutine count returns to normal with this message:

I190223 08:12:07.865814 50034 ccl/changefeedccl/changefeed_processors.go:562 [n1] job 428733347495510017 span /Table/53/1/{1/10/-1031/8/0-3/10/-1952/3} is behind by 430808h12m7.865763101s I190223 08:12:15.587174 50034 ccl/changefeedccl/changefeed_stmt.go:444 [n1] CHANGEFEED job 428733347495510017 returning with error: cannot update progress on canceled job (id 428733347495510017)

Might be coincidence, but perhaps not.

In the test runner logs itself, the failure is

08:10:34 cluster.go:253: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1149020-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3} 
changefeed: 08:10:34 cdc.go:187: started changefeed at (1550909433231766446) 2019-02-23 08:10:33.231766446 +0000 UTC
2: 3483
1: 4776
3: 4766
Error: read tcp 10.142.0.61:56362->10.142.0.130:26257: read: connection reset by peer
Error:  exit status 1

I don't know who prints the Error: lines; it's weird that there's no stack trace. Is this from the workload?

Btw, none of the nodes seem to have died or been oom-killed.

@danhhz any ideas?

ps #33318 would have helped here as well.

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/71681f60031c101f17339236e2ba75f7a684d1a1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1155867&tab=buildLog

The test failed on master:
    cluster.go:1226,cdc.go:612,cdc.go:124,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1155867-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
               0            1.0           18.4   7784.6   7784.6   7784.6   7784.6 stockLevel
             27s        0           17.0           19.2   7247.8   8321.5   8321.5   8321.5 delivery
             27s        0          501.4          128.2   8589.9   9126.8   9126.8   9126.8 newOrder
             27s        0           55.9           19.8   6174.0   7784.6   8321.5   8321.5 orderStatus
             27s        0          222.7          160.0  23622.3  24696.1  24696.1  24696.1 payment
             27s        0           43.9           19.4   6442.5   8321.5   8589.9   8589.9 stockLevel
             28s        0           69.1           21.0   4026.5   8321.5   8321.5   9126.8 delivery
             28s        0          336.5          135.6   9663.7  10200.5  10200.5  10200.5 newOrder
             28s        0           52.1           20.9   2952.8   5368.7   6174.0   6174.0 orderStatus
             28s        0          309.5          165.3   7516.2  24696.1  25769.8  25769.8 payment
             28s        0           57.1           20.7   3355.4   6442.5   7247.8   7247.8 stockLevel
        : signal: killed
    cluster.go:1585,cdc.go:212,cdc.go:367,test.go:1211: unexpected status: failed
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog

The test failed on master:
    cluster.go:1226,cdc.go:612,cdc.go:124,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1158877-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
               0           63.8           16.0   2281.7   4831.8  13421.8  15032.4 stockLevel
             43s        0           15.0           17.2    209.7    402.7    419.4    419.4 delivery
             43s        0          332.7          105.3  20401.1  24696.1  25769.8  25769.8 newOrder
             43s        0           17.0           15.8     18.9     71.3    121.6    121.6 orderStatus
             43s        0          568.4          146.0   4563.4  19327.4  40802.2  40802.2 payment
             43s        0           39.0           16.5    335.5  11274.3  16106.1  16106.1 stockLevel
             44s        0           22.1           17.3    503.3   1342.2   1476.4   1476.4 delivery
             44s        0          270.9          109.1   3489.7  25769.8  25769.8  25769.8 newOrder
             44s        0           23.1           16.0     16.3     62.9     75.5     75.5 orderStatus
             44s        0          320.0          149.9   2818.6  23622.3  40802.2  40802.2 payment
             44s        0           20.1           16.6    209.7  17179.9  17179.9  17179.9 stockLevel
        : signal: killed
    cluster.go:1585,cdc.go:212,cdc.go:367,test.go:1211: unexpected status: failed
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/e6f0a720a98c60ed0545e7cbb4b2f224b463cae9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1165130&tab=buildLog

The test failed on master:
    cdc.go:733,cdc.go:214,cdc.go:367,test.go:1214: initial scan did not complete
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/38bb1e7905b89f911bd74be4f5830217ffb7b343

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1168752&tab=buildLog

The test failed on master:
    cdc.go:733,cdc.go:214,cdc.go:367,test.go:1214: initial scan did not complete
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/5f7de3c35348e847dc0d4ab1ba8d76574e1706c2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1169980&tab=buildLog

The test failed on master:
    cluster.go:1244,cdc.go:612,cdc.go:124,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1169980-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
               0           19.0           21.2    125.8    192.9    192.9    192.9 stockLevel
          15m10s        0           23.9           21.1    113.2    260.0    260.0    260.0 delivery
          15m10s        0          209.5          211.2     60.8    209.7    268.4    268.4 newOrder
          15m10s        0           27.9           21.3     13.6     32.5     44.0     44.0 orderStatus
          15m10s        0          212.5          212.5     35.7    201.3    234.9    268.4 payment
          15m10s        0           21.9           21.2    100.7    176.2    184.5    184.5 stockLevel
          15m11s        0           27.1           21.1     92.3    130.0    159.4    159.4 delivery
          15m11s        0          176.4          211.1     54.5     83.9    113.2    167.8 newOrder
          15m11s        0           23.1           21.3     13.1     27.3     37.7     37.7 orderStatus
          15m11s        0          197.5          212.5     29.4     62.9    113.2    285.2 payment
          15m11s        0           16.0           21.2    104.9    142.6    201.3    201.3 stockLevel
        : signal: killed
    cluster.go:1603,cdc.go:212,cdc.go:367,test.go:1214: unexpected status: failed
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/a512e390f7f2f2629f3f811bab5866c46e3e5713

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1174765&tab=buildLog

The test failed on provisional_201903122203_v19.1.0-beta.20190318:
    cluster.go:1246,cdc.go:625,cdc.go:125,cluster.go:1584,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1174765-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
                158.4      0.0      0.0      0.0      0.0 newOrder
           12m8s      717            0.0           16.2      0.0      0.0      0.0      0.0 orderStatus
           12m8s      717            0.0          161.4      0.0      0.0      0.0      0.0 payment
           12m8s      717            0.0           16.2      0.0      0.0      0.0      0.0 stockLevel
        E190313 06:13:23.769766 1 workload/cli/run.go:419  error in orderStatus: dial tcp 10.128.0.31:26257: connect: connection refused
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
           12m9s      730            0.0           16.1      0.0      0.0      0.0      0.0 delivery
           12m9s      730            0.0          158.2      0.0      0.0      0.0      0.0 newOrder
           12m9s      730            0.0           16.2      0.0      0.0      0.0      0.0 orderStatus
           12m9s      730            0.0          161.2      0.0      0.0      0.0      0.0 payment
           12m9s      730            0.0           16.2      0.0      0.0      0.0      0.0 stockLevel
        : signal: killed
    cluster.go:1605,cdc.go:213,cdc.go:368,test.go:1214: unexpected node event: 2: dead
tbg commented 5 years ago

^- hey look at this. Also, 13k goroutines. Going to look at the breakdown shortly

W190313 06:10:20.813840 358 server/node.go:880  [n2,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:sys.goroutines Value:13173 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
I190313 06:10:26.345689 356 server/status/runtime.go:464  [n2] runtime stats: 6.0 GiB RSS, 13330 goroutines, 582 MiB/271 MiB/1.4 GiB GO alloc/idle/total, 4.0 GiB/4.8 GiB CGO alloc/total, 194757.3 CGO/sec, 905.2/160.2 %(u/s)time, 0.0 %gc (7x), 91 MiB/116 MiB (r/w)net
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8f78e8c pc=0x7f90dc4b5a4a]

runtime stack:
runtime.throw(0x32d2607, 0x2a)
    /usr/local/go/src/runtime/panic.go:608 +0x72 fp=0x7f8c2a3fe7e8 sp=0x7f8c2a3fe7b8 pc=0x72a972
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:374 +0x2f2 fp=0x7f8c2a3fe838 sp=0x7f8c2a3fe7e8 pc=0x740762

goroutine 66830 [syscall]:
non-Go function
    pc=0x7f90dc4b5a4a
non-Go function
    pc=0x2703f72
non-Go function
    pc=0x2959a3f
non-Go function
    pc=0x2951bd4
non-Go function
    pc=0x2701798
non-Go function
    pc=0x2706399
non-Go function
    pc=0x2559612
non-Go function
    pc=0x75a62f
runtime.cgocall(0x25595d0, 0xc048731e70, 0x2cfba80)
    /usr/local/go/src/runtime/cgocall.go:128 +0x5e fp=0xc048731e18 sp=0xc048731de0 pc=0x700f1e
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBIterSeek(0x7f90bfe42080, 0xc03a0f3840, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    _cgo_gotypes.go:752 +0x74 fp=0xc048731e70 sp=0xc048731e18 pc=0x134d0c4
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek.func2(0x7f90bfe42080, 0xc03a0f3840, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2125 +0xb4 fp=0xc048731f38 sp=0xc048731e70 pc=0x13653e4
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek(0xc0466a8268, 0xc03a0f3840, 0xa, 0x14, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2125 +0x161 fp=0xc048732190 sp=0xc048731f38 pc=0x1359a21
github.com/cockroachdb/cockroach/pkg/storage/engine.(*batchIterator).Seek(0xc0466a8268, 0xc03a0f3840, 0xa, 0x14, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1427 +0x5b fp=0xc0487321d0 sp=0xc048732190 pc=0x135599b
github.com/cockroachdb/cockroach/pkg/storage/engine.mvccGetMetadata(0x3a177c0, 0xc0466a8268, 0xc03a0f3840, 0xa, 0x14, 0x0, 0x0, 0xc02201ed80, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/mvcc.go:747 +0x7c fp=0xc048732318 sp=0xc0487321d0 pc=0x133a25c
github.com/cockroachdb/cockroach/pkg/storage/engine.mvccPutInternal(0x39d80a0, 0xc035793410, 0x7f90dd2ff100, 0xc026826240, 0x3a177c0, 0xc0466a8268, 0xc0380f3580, 0xc03a0f3840, 0xa, 0x14, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/mvcc.go:1285 +0xe6 fp=0xc0487329a0 sp=0xc048732318 pc=0x133d696
github.com/cockroachdb/cockroach/pkg/storage/engine.mvccPutUsingIter(0x39d80a0, 0xc035793410, 0x7f90dd2ff100, 0xc026826240, 0x3a177c0, 0xc0466a8268, 0xc0380f3580, 0xc03a0f3840, 0xa, 0x14, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/mvcc.go:1104 +0x164 fp=0xc048732a80 sp=0xc0487329a0 pc=0x133c8a4
github.com/cockroachdb/cockroach/pkg/storage/engine.MVCCDelete(0x39d80a0, 0xc035793410, 0x7f90dc29b878, 0xc026826240, 0xc0380f3580, 0xc03a0f3840, 0xa, 0x14, 0x158b6fcf00e5b32e, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/mvcc.go:1074 +0x1c1 fp=0xc048732b50 sp=0xc048732a80 pc=0x133c6e1
github.com/cockroachdb/cockroach/pkg/storage/batcheval.Delete(0x39d80a0, 0xc035793410, 0x7f90dc29b878, 0xc026826240, 0x3a1f500, 0xc0053bfb00, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/batcheval/cmd_delete.go:36 +0x123 fp=0xc048732c28 sp=0xc048732b50 pc=0x14f2fc3
github.com/cockroachdb/cockroach/pkg/storage.evaluateCommand(0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0xa, 0x7f90dc29b878, 0xc026826240, 0x3a1f500, 0xc0053bfb00, 0xc0380f3580, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_evaluate.go:393 +0x280 fp=0xc048733320 sp=0xc048732c28 pc=0x195e060
github.com/cockroachdb/cockroach/pkg/storage.evaluateBatch(0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0x7f90dc29b878, 0xc026826240, 0x3a1f500, 0xc0053bfb00, 0xc0380f3580, 0x158b6fcf00e5b32e, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_evaluate.go:225 +0x31d fp=0xc048733dd0 sp=0xc048733320 pc=0x195d04d
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).evaluateWriteBatchWithLocalRetries(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0x3a1f500, 0xc0053bfb00, 0xc0380f3580, 0x158b6fcf00e5b32e, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_write.go:447 +0x224 fp=0xc0487341c8 sp=0xc048733dd0 pc=0x1992804
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).evaluateWriteBatch(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_write.go:385 +0x215 fp=0xc0487350a8 sp=0xc0487341c8 pc=0x19912d5
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).evaluateProposal(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go:846 +0x174 fp=0xc0487358a0 sp=0xc0487350a8 pc=0x196a494
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).requestToProposal(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0xc0266b6f50, 0x8, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go:943 +0xec fp=0xc0487359b0 sp=0xc0487358a0 pc=0x196afec
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).propose(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0x158b6d62bddb4ad1, 0x0, 0x0, 0x200000002, 0x3, 0x0, 0xc01394f7b0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:147 +0x4ad fp=0xc048736128 sp=0xc0487359b0 pc=0x196ca8d
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryExecuteWriteBatch(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_write.go:171 +0x68b fp=0xc048736860 sp=0xc048736128 pc=0x198fd5b
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeWriteBatch(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_write.go:51 +0x7a fp=0xc048736938 sp=0xc048736860 pc=0x198f59a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc0053bfb00, 0x39d80a0, 0xc035793410, 0x1a5, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:514 +0x6a6 fp=0xc048736ab0 sp=0xc048736938 pc=0x1947b06
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc0053bfb00, 0x39d80a0, 0xc0357933b0, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:462 +0x83 fp=0xc048736b70 sp=0xc048736ab0 pc=0x1947423
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc0010e8000, 0x39d80a0, 0xc0357933b0, 0x158b6fcf00e5b32e, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2896 +0x70c fp=0xc0487370a8 sp=0xc048736b70 pc=0x19a789c
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc0006a8be0, 0x39d80a0, 0xc035793350, 0x0, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:185 +0xcb fp=0xc048737198 sp=0xc0487370a8 pc=0x19c20bb
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x39d80a0, 0xc035793350, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:981 +0x1b0 fp=0xc0487372a8 sp=0xc048737198 pc=0x24b4df0
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc0001e10e0, 0x39d80a0, 0xc035793350, 0x3281312, 0x10, 0xc048737338, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:304 +0xee fp=0xc0487372e0 sp=0xc0487372a8 pc=0x1084d5e
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000c7a000, 0x39d80a0, 0xc035793350, 0xc024fb1580, 0xc035793350, 0x38, 0x20300d)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:968 +0x16a fp=0xc048737368 sp=0xc0487372e0 pc=0x248e2fa
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000c7a000, 0x39d80a0, 0xc035793350, 0xc024fb1580, 0x0, 0xc048737438, 0x7e4f2c)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:1009 +0x9c fp=0xc0487373e8 sp=0xc048737368 pc=0x248e40c
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(0x39ab8a0, 0xc000c7a000, 0x39d80a0, 0xc0357932f0, 0xc024fb1580, 0x0, 0x0, 0x0, 0xc035793140, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:456 +0x4f fp=0xc048737430 sp=0xc0487373e8 pc=0x1521fff
github.com/cockroachdb/cockroach/pkg/rpc.(*internalClientAdapter).Batch(0xc000220ab0, 0x39d80a0, 0xc0357932f0, 0xc024fb1580, 0x0, 0x0, 0x0, 0x158b6fcf00e5b32e, 0x0, 0x0)
    <autogenerated>:1 +0x99 fp=0xc048737498 sp=0xc048737430 pc=0x152da59
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).sendBatch(0xc0357932c0, 0x39d80a0, 0xc0357932f0, 0x2, 0x39b6fe0, 0xc000220ab0, 0x0, 0x0, 0x200000002, 0x3, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:239 +0x126 fp=0xc048737550 sp=0xc048737498 pc=0x157bb76
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext.func1()
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:207 +0xa1 fp=0xc048737638 sp=0xc048737550 pc=0x158fa01
github.com/cockroachdb/cockroach/pkg/kv.withMarshalingDebugging(0x39d80a0, 0xc0357932f0, 0x0, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:188 +0xab fp=0xc048737670 sp=0xc048737638 pc=0x157b70b
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext(0xc0357932c0, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x200000002, 0x3, 0x1a5, 0x0, 0xc017500600, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:206 +0x211 fp=0xc048737850 sp=0xc048737670 pc=0x157b971
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc00011b320, 0x39d80a0, 0xc035793140, 0xc00011b370, 0x1a5, 0xc02b9875e0, 0x3, 0x3, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1363 +0x2d3 fp=0xc048737bf8 sp=0xc048737850 pc=0x15719d3
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc00011b320, 0x39d80a0, 0xc035793140, 0x1a5, 0xc02b9875e0, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x244 fp=0xc048737d18 sp=0xc048737bf8 pc=0x156bcb4
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc00011b320, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500600, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x221 fp=0xc048737e98 sp=0xc048737d18 pc=0x156c231
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc00011b320, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500600, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1139 +0x322 fp=0xc048738448 sp=0xc048737e98 pc=0x156fb22
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc00011b320, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500600, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:962 +0x8b3 fp=0xc048738810 sp=0xc048738448 pc=0x156e953
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc00011b320, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:710 +0x48b fp=0xc048738af8 sp=0xc048738810 pc=0x156d7eb
github.com/cockroachdb/cockroach/pkg/kv.(*txnLockGatekeeper).SendLocked(0xc038e6f190, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:234 +0xe8 fp=0xc048738bb0 sp=0xc048738af8 pc=0x157c9e8
github.com/cockroachdb/cockroach/pkg/kv.(*txnMetricRecorder).SendLocked(0xc038e6f158, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_metric_recorder.go:51 +0xa2 fp=0xc048738c68 sp=0xc048738bb0 pc=0x1587b22
github.com/cockroachdb/cockroach/pkg/kv.(*txnCommitter).SendLocked(0xc038e6f148, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_committer.go:44 +0x219 fp=0xc048738d28 sp=0xc048738c68 pc=0x1583229
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc038e6f0c8, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:160 +0x83 fp=0xc048738e08 sp=0xc048738d28 pc=0x158a433
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).SendLocked(0xc038e6f0c8, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:101 +0xf9 fp=0xc048738ef8 sp=0xc048738e08 pc=0x1589f09
github.com/cockroachdb/cockroach/pkg/kv.(*txnPipeliner).SendLocked(0xc038e6f040, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_pipeliner.go:182 +0xf9 fp=0xc048739010 sp=0xc048738ef8 pc=0x1587fe9
github.com/cockroachdb/cockroach/pkg/kv.(*txnIntentCollector).SendLocked(0xc038e6f000, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_intent_collector.go:101 +0x43d fp=0xc0487391a0 sp=0xc048739010 pc=0x15860dd
github.com/cockroachdb/cockroach/pkg/kv.(*txnSeqNumAllocator).SendLocked(0xc038e6efe8, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_seq_num_allocator.go:92 +0x23b fp=0xc0487392d0 sp=0xc0487391a0 pc=0x1589ccb
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeater).SendLocked(0xc038e6ef38, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc017500500, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeater.go:224 +0x27b fp=0xc048739438 sp=0xc0487392d0 pc=0x1583b9b
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc038e6ed80, 0x39d80a0, 0xc035793140, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:762 +0x591 fp=0xc048739788 sp=0xc048739438 pc=0x157f791
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc0005ae300, 0x39d80a0, 0xc02c3bef90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:622 +0x119 fp=0xc048739858 sp=0xc048739788 pc=0x116e2f9
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc019f01e60, 0x39d80a0, 0xc02c3bef90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:791 +0x13c fp=0xc0487399b8 sp=0xc048739858 pc=0x117854c
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send-fm(0x39d80a0, 0xc02c3bef90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:485 +0x76 fp=0xc048739a70 sp=0xc0487399b8 pc=0x117c266
github.com/cockroachdb/cockroach/pkg/internal/client.sendAndFill(0x39d80a0, 0xc02c3bef90, 0xc048739bd8, 0xc035c10a00, 0xc025ea7b00, 0xc02c199640)
    /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:547 +0xeb fp=0xc048739ba8 sp=0xc048739a70 pc=0x116dd1b
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Run(0xc019f01e60, 0x39d80a0, 0xc02c3bef90, 0xc035c10a00, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:485 +0xcf fp=0xc048739bf8 sp=0xc048739ba8 pc=0x1176a7f
github.com/cockroachdb/cockroach/pkg/sql.(*tableWriterBase).finalize(0xc027700a28, 0x39d80a0, 0xc02c3bef90, 0xc0443d7400, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/tablewriter.go:164 +0xed fp=0xc048739c38 sp=0xc048739bf8 pc=0x20de34d
github.com/cockroachdb/cockroach/pkg/sql.(*tableUpdater).finalize(0xc027700a28, 0x39d80a0, 0xc02c3bef90, 0xc035a85600, 0xc009b8f450, 0xc037bec900, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/tablewriter_update.go:68 +0x48 fp=0xc048739c78 sp=0xc048739c38 pc=0x20e10d8
github.com/cockroachdb/cockroach/pkg/sql.(*updateNode).BatchedNext(0xc027700a00, 0x39d80a0, 0xc02c3bef90, 0xc035a85600, 0xc009b8f450, 0xc02c3bf080, 0x0, 0x39d80a0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/update.go:572 +0x23e fp=0xc048739ce0 sp=0xc048739c78 pc=0x20f322e
github.com/cockroachdb/cockroach/pkg/sql.(*serializeNode).Next(0xc02c3bec90, 0x39d80a0, 0xc02c3bef90, 0xc035a85600, 0xc009b8f450, 0x0, 0xc02c3bf080, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_batch.go:123 +0xa1 fp=0xc048739d30 sp=0xc048739ce0 pc=0x2086cd1
github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Next(0xc0207b2000, 0xc009b8f450, 0x39d9e60, 0xc02c3bec90, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:178 +0x447 fp=0xc048739e60 sp=0xc048739d30 pc=0x2089b87
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x39d80a0, 0xc02c3bef90, 0x39e38a0, 0xc0207b2000, 0x39c9520, 0xc031fc0700)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:175 +0x35 fp=0xc048739ea0 sp=0xc048739e60 pc=0x1da2205
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc0207b2000, 0x39d80a0, 0xc02c3bef90)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92 fp=0xc048739ee0 sp=0xc048739ea0 pc=0x1dd2df2
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc02ab5c700, 0x39d80a0, 0xc02c3bef90, 0x33d6c08, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:624 +0x1ec fp=0xc048739f48 sp=0xc048739ee0 pc=0x1db24dc
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc000175040, 0xc0276d4180, 0xc019f01e60, 0xc04873a898, 0xc049f1c2c0, 0xc009b8f510, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:252 +0x8ad fp=0xc04873a788 sp=0xc048739f48 pc=0x200944d
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc000175040, 0x39d80a0, 0xc02c3be4b0, 0xc009b8f510, 0xc0276d4180, 0xc019f01e60, 0x39d9ca0, 0xc02eced680, 0xc049f1c2c0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:839 +0x227 fp=0xc04873aa48 sp=0xc04873a788 pc=0x200cb37
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc009b8f100, 0x39d80a0, 0xc02c3be4b0, 0xc009b8f450, 0x3, 0x7f90dc2d9940, 0xc02f455760, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1100 +0x283 fp=0xc04873aad8 sp=0xc04873aa48 pc=0x1fb3ec3
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc009b8f100, 0x39d80a0, 0xc02c3be4b0, 0xc009b8f450, 0x7f90dc2d9940, 0xc02f455760, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:945 +0x658 fp=0xc04873abd8 sp=0xc04873aad8 pc=0x1fb2dc8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc009b8f100, 0x39d80a0, 0xc02c3be4b0, 0x39de6e0, 0xc03aa47030, 0xc04f122bdd, 0xe4, 0x0, 0xc013029720, 0x2, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:457 +0xdd4 fp=0xc04873b3b8 sp=0xc04873abd8 pc=0x1faeeb4
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc009b8f100, 0x39d80a0, 0xc02c3be4b0, 0x39de6e0, 0xc03aa47030, 0xc04f122bdd, 0xe4, 0x0, 0xc013029720, 0x2, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:103 +0x610 fp=0xc04873b660 sp=0xc04873b3b8 pc=0x1fadc60
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc009b8f100, 0x39d7fe0, 0xc011a9e940, 0xc0006952f8, 0x5400, 0x15000, 0xc000695390, 0xc0117fa4f0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1237 +0x140b fp=0xc04873bec8 sp=0xc04873b660 pc=0x1fa3d3b
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000b2e000, 0x39d7fe0, 0xc011a9e940, 0xc009b8f100, 0x5400, 0x15000, 0xc000695390, 0xc0117fa4f0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:429 +0xce fp=0xc04873bf28 sp=0xc04873bec8 pc=0x1f9f6be
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000b2e000, 0xc040479600, 0xc0110b85d4, 0xc0117fa4f0, 0x39d7fe0, 0xc011a9e940, 0xc009b8f100, 0x5400, 0x15000, 0xc000695390, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:332 +0xcf fp=0xc04873bf88 sp=0xc04873bf28 pc=0x22201bf
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc04873bf90 sp=0xc04873bf88 pc=0x75aeb1
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:315 +0x1040
tbg commented 5 years ago

The goroutine dump shows that everyone and their dog are in runtime.gopark. The machines here are n1-highcpu-16 which have ~14424mb of ram, so judging from the last stats printout we're not even running out of memory:

I190313 06:10:26.345689 356 server/status/runtime.go:464 [n2] runtime stats: 6.0 GiB RSS, 13330 goroutines, 582 MiB/271 MiB/1.4 GiB GO alloc/idle/total, 4.0 GiB/4.8 GiB CGO alloc/total, 194757.3 CGO/sec, 905.2/160.2 %(u/s)time, 0.0 %gc (7x), 91 MiB/116 MiB (r/w)net

panicparse is unfortunately not good at categorizing these, it only ever eats the first line of each stack trace, not sure why. I'll spend a few minutes trying to figure out why it does that, that'll be much easier than hand combing 13k goroutines.

tbg commented 5 years ago

Ah, needed to sed -e 's/ fp=.*//g' (this isn't a panic dump, so it has some extra info that trips up pp).

The goroutines with notable multiplicity are

Routine
1707: chan receive [8~10 minutes] [Created by server.go:678 (*Server).serveStreams.func1({[] [] false})]
Stack
- proc.go:302 gopark({[{54378816 #90} {824748121496 *} {5901 *} {3 }] [] false})
- proc.go:308 goparkunlock({[{824748121496 *} {5901 } {3 }] [] false})
- chan.go:520 chanrecv({[{824748121408 *} {824947137008 *} {1 } {824947136736 *}] [] false})
- chan.go:402 chanrecv1({[{824748121408 *} {824947137008 *}] [] false})
- replica_rangefeed.go:244 (*Replica).RangeFeed({[{825021269120 *} {825014300896 *} {60784256 #147} {824829156944 *} {53000873 #66} {20 } {60770336 #144} {824642466016 #1468} {0 }] [] false})
- store.go:3043 (*Store).RangeFeed({[{824651448320 #2475} {825014300896 *} {60784256 #147} {824829156944 *} {0 }] [] false})
- stores.go:210 (*Stores).RangeFeed({[{824640703456 #1421} {825014300896 *} {60784256 #147} {824829156944 *} {16 }] [] false})
- node.go:1089 (*Node).RangeFeed({[{824646803456 #1856} {825014300896 *} {60784256 #147} {824829156944 *} {824646803456 #1856} {824726518816 #4322}] [] false})
- api.pb.go:9396 _Internal_RangeFeed_Handler({[{52463840 #56} {824646803456 #1856} {60771008 #145} {824771164000 *} {7383496 } {32 }] [] false})
- context.go:239 NewServerWithInterceptor.func3({[{52463840 #56} {824646803456 #1856} {60771008 #145} {824771164000 *} {824839464896 *} {54355744 #75} {60657056 #127} {824635244736 #490}] [] false})
- server.go:1167 (*Server).processStreamingRPC({[{824643576256 #1623} {60820256 #155} {824727666688 #4588} {824723819776 *} {824644294416 #1676} {90820224 #166} {0 } {0 } {0 }] [] false})
- server.go:1253 (*Server).handleStream({[{824643576256 #1623} {60820256 #155} {824727666688 #4588} {824723819776 *} {0 }] [] false})
- server.go:680 (*Server).serveStreams.func1.1({[{824637933904 #1178} {824643576256 #1623} {60820256 #155} {824727666688 #4588} {824723819776 *}] [] false})
- asm_amd64.s:1333 goexit({[] [] false})
666: IO wait [Created by net.go:136 (*Server).ServeWith({[] [] false})]
Stack
- proc.go:302 gopark({[{54378768 #89} {140260017864928 *} {824634186498 *} {5 }] [] false})
- netpoll.go:366 netpollblock({[{140260017864888 *} {114 } {4096 }] [] false})
- netpoll.go:173 runtime_pollWait({[{140260017864888 *} {114 } {825330931360 *}] [] false})
- fd_poll_runtime.go:85 (*pollDesc).wait({[{824851198232 *} {114 } {18446744073709551360 } {60448256 #104} {91100784 #170}] [] false})
- fd_poll_runtime.go:90 (*pollDesc).waitRead({[{824851198232 *} {824970498048 *} {4096 } {4096 }] [] false})
- fd_unix.go:169 (*FD).Read({[{824851198208 *} {824970498048 *} {4096 } {4096 } {0 } {0 } {0 }] [] false})
- fd_unix.go:202 (*netFD).Read({[{824851198208 *} {824970498048 *} {4096 } {4096 } {0 } {825330931672 *} {8954872 }] [] false})
- net.go:177 (*conn).Read({[{824957158528 *} {824970498048 *} {4096 } {4096 } {0 } {0 } {0 }] [] false})
- cmux.go:218 (*MuxConn).Read({[{824727436640 *} {824970498048 *} {4096 } {4096 } {0 } {0 } {0 }] [] false})
- conn.go:1622 (*readTimeoutConn).Read({[{824638135040 *} {824970498048 *} {4096 } {4096 } {0 } {0 } {0 }] [] false})
- bufio.go:100 (*Reader).fill({[{824747410504 *}] [] false})
- bufio.go:242 (*Reader).ReadByte({[{824747410504 *} {824747411984 *} {3418157275668 *} {95255200 *}] [] false})
- encoding.go:120 (*ReadBuffer).ReadTypedMsg({[{824747412264 *} {60544608 #117} {824747410504 *} {60519136 #115} {95595432 #315} {0 } {0 }] [] false})
- conn.go:346 (*conn).serveImpl({[{824747410432 *} {60653536 #122} {824723358080 *} {825000127392 *} {824645443584 #1847} {21504 } {86016 } {824640623504 #1418} {824635691232 #494} {0 }] [] true})
- conn.go:168 serveConn({[{60653536 #122} {824843083712 *} {60803072 #153} {824727436640 *} {824724590601 *} {4 } {824749649376 *} {60513376 #113} {824777849360 *} {16384 }] [] true})
- server.go:518 (*Server).ServeConn({[{824640623104 #1416} {60653536 #122} {824843083712 *} {60803072 #153} {824727436640 *} {0 } {0 }] [] false})
- server.go:1712 (*Server).Start.func20.1({[{60803072 #153} {824727436640 *}] [] false})
- net.go:139 (*Server).ServeWith.func1({[{824635691232 #494} {60653728 #2} {824650132720 #2223} {824639488760 #1387} {60803072 #153} {824727436640 *} {824650132768 #2224}] [] false})
- asm_amd64.s:1333 goexit({[] [] false})
Routine
1504: select [0~9 minutes] [Created by errgroup.go:54 (*Group).Go({[] [] false})]
Stack
- proc.go:302 gopark({[{54378880 #1} {0 } {6153 } {1 }] [] false})
- select.go:313 selectgo({[{824841472296 *} {824841472224 *} {2 } {0 } {1 *}] [] false})
- context.go:470 rangeFeedClientAdapter.Recv({[{60653536 #122} {824826666560 *} {824957838752 *} {824957838848 *} {16 *} {824969093792 *} {825016820944 *}] [] false})
- <autogenerated>:1 (*rangeFeedClientAdapter).Recv({[{824826616480 *} {824841472632 *} {2 } {0 }] [] false})
- dist_sender_rangefeed.go:241 (*DistSender).singleRangeFeed({[{824634880800 #486} {60653536 #122} {825016820928 *} {1552456829945648143 #106722} {0 } {0 } {0 } {0 } {0 } {0 }] [] true})
- dist_sender_rangefeed.go:146 (*DistSender).partialRangeFeed({[{824634880800 #486} {60653536 #122} {825016820928 *} {1552456829945648143 #106722} {0 } {0 } {0 } {0 } {0 } {0 }] [] true})
- dist_sender_rangefeed.go:67 (*DistSender).RangeFeed.func1.1({[{60653536 #122} {825016820928 *} {60653536 *} {824650837904 *}] [] false})
- ctxgroup.go:170 Group.GoCtx.func1({[{824633720840 *} {54379232 #91}] [] false})
- errgroup.go:57 (*Group).Go.func1({[{825021767776 *} {824835422304 *}] [] false})
- asm_amd64.s:1333 goexit({[] [] false})

Routine
1504: chan receive [8~10 minutes] [Created by context.go:522 internalClientAdapter.RangeFeed({[] [] false})]
Stack
- proc.go:302 gopark({[{54378816 #90} {824723762744 *} {5901 *} {3 }] [] false})
- proc.go:308 goparkunlock({[{824723762744 *} {5901 } {3 }] [] false})
- chan.go:520 chanrecv({[{824723762656 *} {824837369248 *} {1 } {824837368976 *}] [] false})
- chan.go:402 chanrecv1({[{824723762656 *} {824837369248 *}] [] false})
- replica_rangefeed.go:244 (*Replica).RangeFeed({[{824760885248 *} {825013584352 *} {60797312 #148} {824722335904 *} {53000873 #66} {20 } {60770336 #144} {824642466016 #1468} {0 }] [] false})
- store.go:3043 (*Store).RangeFeed({[{824651448320 #2475} {825013584352 *} {60797312 #148} {824722335904 *} {0 }] [] false})
- stores.go:210 (*Stores).RangeFeed({[{824640703456 #1421} {825013584352 *} {60797312 #148} {824722335904 *} {824729458544 *}] [] false})
- node.go:1089 (*Node).RangeFeed({[{824646803456 #1856} {825013584352 *} {60797312 #148} {824722335904 *} {824729458552 *} {22427082 *}] [] false})
- context.go:524 internalClientAdapter.RangeFeed.func1({[{824826462640 *} {60471456 #106} {824646803456 #1856} {825013584352 *} {60653536 #122} {824819689024 *} {824723762464 *} {824723762560 *}] [] false})
- asm_amd64.s:1333 goexit({[] [] false})
tbg commented 5 years ago
3211: select [0~9 minutes] [Created by stopper.go:320 (*Stopper).RunAsyncTask({[] [] false})]
Stack
- proc.go:302 gopark({[{54378880 #1} {0 } {6153 } {1 }] [] false})
- select.go:313 selectgo({[{824989961728 *} {824989961668 *} {3 } {0 } {1 *}] [] false})
- registry.go:184 (*registration).outputLoop({[{824763442272 *} {60653536 #122} {824947025984 *} {824947025984 *} {825019916352 *}] [] false})
- registry.go:201 (*registration).runOutputLoop({[{824763442272 *} {60653536 #122} {824947025984 *}] [] false})
- processor.go:220 (*Processor).Start.func1.1({[{60653536 #122} {824947025728 *}] [] false})
- stopper.go:325 (*Stopper).RunAsyncTask.func1({[{824635691232 #494} {60653536 #122} {824947025728 *} {824770253952 *} {30 } {0 } {0 } {825005284864 *}] [] false})
- asm_amd64.s:1333 goexit({[] [] false})
641: sync.Cond.Wait [Created by conn.go:315 (*conn).serveImpl({[] [] false})]
Stack
- proc.go:302 gopark({[{54378816 #90} {824723358424 *} {6675 *} {3 }] [] false})
- proc.go:308 goparkunlock({[{824723358424 *} {6675 } {3 }] [] false})
- sema.go:510 runtime_notifyListWait({[{824723358416 *} {1660 *}] [] false})
- cond.go:56 (*Cond).Wait({[{824723358400 *}] [] false})
- conn_io.go:429 (*StmtBuf).curCmd({[{824747414288 *} {0 } {0 } {0 } {0 } {0 }] [] false})
- conn_executor.go:1143 (*connExecutor).run({[{824822563712 *} {60653536 #122} {824739690304 *} {824640623352 #1417} {21504 } {86016 } {824640623504 #1418} {825045412464 *} {0 } {0 }] [] false})
- conn_executor.go:429 (*Server).ServeConn({[{824645443584 #1847} {60653536 #122} {824739690304 *} {824822563712 *} {21504 } {86016 } {824640623504 #1418} {825045412464 *} {0 } {0 }] [] false})
- conn.go:332 (*conn).serveImpl.func4({[{824645443584 #1847} {824747412736 *} {825044716048 *} {825045412464 *} {60653536 #122} {824739690304 *} {824822563712 *} {21504 } {86016 } {824640623504 #1418}] [] true})
- asm_amd64.s:1333 goexit({[] [] false})

Anyway, what this looks like is that the moment CDC started doing its thing things went south, though it's unclear to me how they went south exactly.

There's a GC in process, and look at that idle time --

goroutine 2 [force gc (idle), 57 minutes]:
runtime.gopark(0x33dc140, 0x5acf5c0, 0x1410, 0x1)
        /usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc00008af80 sp=0xc00008af60 pc=0x72c6db
runtime.goparkunlock(0x5acf5c0, 0x1410, 0x1)
        /usr/local/go/src/runtime/proc.go:308 +0x53 fp=0xc00008afb0 sp=0xc00008af80 pc=0x72c783
runtime.forcegchelper()
        /usr/local/go/src/runtime/proc.go:251 +0xb3 fp=0xc00008afe0 sp=0xc00008afb0 pc=0x72c553
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00008afe8 sp=0xc00008afe0 pc=0x75aeb1
created by runtime.init.4
        /usr/local/go/src/runtime/proc.go:240 +0x35

goroutine 3 [GC sweep wait]:
runtime.gopark(0x33dc140, 0x5ad60a0, 0x75140c, 0x1)
        /usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc00008b780 sp=0xc00008b760 pc=0x72c6db
runtime.goparkunlock(0x5ad60a0, 0x398140c, 0x1)
        /usr/local/go/src/runtime/proc.go:308 +0x53 fp=0xc00008b7b0 sp=0xc00008b780 pc=0x72c783
runtime.bgsweep(0xc0000b4000)
        /usr/local/go/src/runtime/mgcsweep.go:71 +0x102 fp=0xc00008b7d8 sp=0xc00008b7b0 pc=0x71e352
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00008b7e0 sp=0xc00008b7d8 pc=0x75aeb1
created by runtime.gcenable
        /usr/local/go/src/runtime/mgc.go:216 +0x58

goroutine 4 [finalizer wait, 57 minutes]:
runtime.gopark(0x33dc140, 0x5b2ab08, 0x140f, 0x1)
        /usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc00008bf28 sp=0xc00008bf08 pc=0x72c6db
runtime.goparkunlock(0x5b2ab08, 0x33d140f, 0x1)
        /usr/local/go/src/runtime/proc.go:308 +0x53 fp=0xc00008bf58 sp=0xc00008bf28 pc=0x72c783
runtime.runfinq()
        /usr/local/go/src/runtime/mfinal.go:175 +0x99 fp=0xc00008bfe0 sp=0xc00008bf58 pc=0x715c69
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00008bfe8 sp=0xc00008bfe0 pc=0x75aeb1
created by runtime.createfing
        /usr/local/go/src/runtime/mfinal.go:156 +0x61

but all the GC workers are idle (there are like a dozen of those)

goroutine 11 [GC worker (idle)]:
runtime.gopark(0x33dbff0, 0xc000494020, 0x1417, 0x0)
        /usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc00049af60 sp=0xc00049af40 pc=0x72c6db
runtime.gcBgMarkWorker(0xc00007ca00)
        /usr/local/go/src/runtime/mgc.go:1772 +0xfc fp=0xc00049afd8 sp=0xc00049af60 pc=0x71955c
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00049afe0 sp=0xc00049afd8 pc=0x75aeb1
created by runtime.gcBgMarkStartWorkers
        /usr/local/go/src/runtime/mgc.go:1720 +0x77

approx 20 goroutines are stuck waiting for the Go rocksdb sync loop (none for long). The handful of goroutines in cgo seemed runnable.

danhhz commented 5 years ago

The high memory usage is probably a symptom of what I just now discovered and wrote up in https://github.com/cockroachdb/cockroach/issues/35114#issuecomment-472506689. Summary is that we end up in a loop unable to finish a catchup scan, emitting the same events over and over and the changefeed can't keep up.

FYI RangeFeeds currently have 2 goroutines per range server side and 1 per range client side (or something like that, it's definitely O(ranges)), so goroutines are expected to be high. Haven't looked specifically here if that's what's going on, but could be expected.

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/04ef15974085e14f758b20c552a84052eac9fa2b

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1180753&tab=buildLog

The test failed on master:
    cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1180753-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
        0           19.5      0.0      0.0      0.0      0.0 newOrder
           3m36s     1269            0.0            3.1      0.0      0.0      0.0      0.0 orderStatus
           3m36s     1269            0.0           27.4      0.0      0.0      0.0      0.0 payment
           3m36s     1269            0.0            3.1      0.0      0.0      0.0      0.0 stockLevel
        E190316 07:09:09.115193 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.156:26257: connect: connection refused
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
           3m37s     1617            0.0            2.9      0.0      0.0      0.0      0.0 delivery
           3m37s     1617            0.0           19.4      0.0      0.0      0.0      0.0 newOrder
           3m37s     1617            0.0            3.1      0.0      0.0      0.0      0.0 orderStatus
           3m37s     1617            0.0           27.3      0.0      0.0      0.0      0.0 payment
           3m37s     1617            0.0            3.1      0.0      0.0      0.0      0.0 stockLevel
        : signal: killed
    test.go:1202: test timed out (11h49m5.997039001s)
    cluster.go:1626,cdc.go:213,cdc.go:368,test.go:1214: unexpected node event: 3: dead
    test.go:978,asm_amd64.s:523,panic.go:513,log.go:219,cluster.go:926,context.go:90,cluster.go:916,test.go:1159,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1626,cdc.go:213,cdc.go:368,test.go:1214: write /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190316-1180753/cdc/tpcc-1000/rangefeed=true/test.log: file already closed
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/c2939ec9a4f15b7fb8683a5805deeb241953e7aa

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1182991&tab=buildLog

The test failed on master:
    cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1182991-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
             0.0      0.0      0.0      0.0      0.0 delivery
          15m41s     1408            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
          15m41s     1408            0.0            0.5      0.0      0.0      0.0      0.0 orderStatus
          15m41s     1408            0.0            0.0      0.0      0.0      0.0      0.0 payment
          15m41s     1408            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
          15m42s     1507            0.0            0.0      0.0      0.0      0.0      0.0 delivery
          15m42s     1507            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
          15m42s     1507            0.0            0.5      0.0      0.0      0.0      0.0 orderStatus
          15m42s     1507            0.0            0.0      0.0      0.0      0.0      0.0 payment
          15m42s     1507            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
        E190318 07:49:40.680374 1 workload/cli/run.go:420  error in payment: dial tcp 10.142.0.73:26257: connect: connection refused
        : signal: killed
    cluster.go:1626,cdc.go:213,cdc.go:368,test.go:1214: dial tcp 35.231.94.136:26257: connect: connection refused
tbg commented 5 years ago

Hey, what happened two up (node 3 dead)? No node logs are present, neither fetched nor included in debug zip. The test.log indicates an attempt to get them wasn't even made. The test ran for 12h so I assume the cluster expired or something like that. Wish that were clearer (i.e. hope I get around to improving that UX someday).

https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/1180753:id/cdc/tpcc-1000/rangefeed%3Dtrue/test.log

tbg commented 5 years ago

The last test failure has logs, but the node seems to have just gotten killed, much earlier in. Probably ran out of memory as these tests to these days.

danhhz commented 5 years ago

The last failure has all kinds of crazy stuff: TONS of splits in the logs, node liveness failures, waiting on a latch for 1 minute, go memory usage growth, more than 1000 occurrences of publish (1 leases): desc=[{order_line 60 1}]. It does look like the memory usage was ultimately what did it in

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/f95d45653df6be5587fb9887de241f50b6932000

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1185396&tab=buildLog

The test failed on master:
    cluster.go:1267,cdc.go:625,cdc.go:125,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1185396-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
        l
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
             25s        0            0.0           19.7      0.0      0.0      0.0      0.0 delivery
             25s        0          100.1           61.0   6710.9   6979.3   6979.3   6979.3 newOrder
             25s        0            0.0           21.1      0.0      0.0      0.0      0.0 orderStatus
             25s        0           42.0          200.0  21474.8  22548.6  22548.6  22548.6 payment
             25s        0            0.0           19.9      0.0      0.0      0.0      0.0 stockLevel
             26s        0            0.0           18.9      0.0      0.0      0.0      0.0 delivery
             26s        0          268.7           69.0   7784.6   8053.1   8053.1   8053.1 newOrder
             26s        0            0.0           20.3      0.0      0.0      0.0      0.0 orderStatus
             26s        0           24.0          193.2  22548.6  23622.3  23622.3  23622.3 payment
             26s        0            0.0           19.1      0.0      0.0      0.0      0.0 stockLevel
        : signal: killed
    cluster.go:1626,cdc.go:213,cdc.go:368,test.go:1214: unexpected status: failed
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/25398c010b2af75b11fed189680ea6b9645f0cf5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1199659&tab=buildLog

The test failed on master:
    cluster.go:1193,cdc.go:66,cdc.go:376,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1199659-cdc-tpcc-1000-rangefeed-true:1-3 --encrypt returned:
        stderr:

        stdout:
        LE_RPC_COMPRESSION=false ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-east1,zone=us-east1-b --enterprise-encryption=path=/mnt/data1/cockroach,key=/mnt/data1/cockroach/aes-128.key,old-key=plain >> ${HOME}/logs/cockroach.stdout 2>> ${HOME}/logs/cockroach.stderr || (x=$?; cat ${HOME}/logs/cockroach.stderr; exit $x)
        Connection to 34.73.99.17 closed by remote host.

        github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:397
        github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
            /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1320
        runtime.goexit
            /usr/local/go/src/runtime/asm_amd64.s:1333: 
        I190326 06:27:40.520369 1 cluster_synced.go:1402  command failed
        : exit status 1
danhhz commented 5 years ago

last one failed in roachprod start

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/a1e6e9decc9dec15a32bbb6d30efc67ca45a532a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1204585&tab=buildLog

The test failed on release-19.1:
    cluster.go:1267,cdc.go:633,cdc.go:133,cluster.go:1605,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1204585-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
        l
           2m16s        0           17.0           20.2     92.3    151.0    151.0    151.0 delivery
           2m16s        0          237.7          196.9     56.6    151.0    176.2    209.7 newOrder
           2m16s        0           23.0           21.2     12.1     22.0     54.5     54.5 orderStatus
           2m16s        0          249.7          214.1     27.3    159.4    192.9    201.3 payment
           2m16s        0           22.0           21.2    130.0    218.1    243.3    243.3 stockLevel
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
           2m17s        0           14.0           20.1     96.5    130.0    134.2    134.2 delivery
           2m17s        0          206.0          197.0     56.6    121.6    142.6    151.0 newOrder
           2m17s        0           26.0           21.2     11.5     39.8     48.2     48.2 orderStatus
           2m17s        0          213.0          214.0     30.4    100.7    117.4    121.6 payment
           2m17s        0           18.0           21.2    125.8    285.2    285.2    285.2 stockLevel
        : signal: killed
    cluster.go:1626,cdc.go:221,cdc.go:376,test.go:1216: unexpected status: failed
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/3aadd20bbf0940ef65f8b2cdcda498401ba5d9c6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1206925&tab=buildLog

The test failed on release-19.1:
    cluster.go:1293,cdc.go:633,cdc.go:133,cluster.go:1631,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1206925-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
        l
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
             33s        0           29.0           23.7     83.9    142.6    151.0    151.0 delivery
             33s        0          182.0          163.4     48.2    125.8    142.6    159.4 newOrder
             33s        0           20.0           23.2      8.4     35.7     39.8     39.8 orderStatus
             33s        0          247.0          238.5     30.4    104.9    121.6    125.8 payment
             33s        0           32.0           25.2    104.9    159.4    159.4    159.4 stockLevel
             34s        0           19.0           23.6     83.9    142.6    151.0    151.0 delivery
             34s        0          150.0          163.0     52.4    109.1    121.6    130.0 newOrder
             34s        0           22.0           23.1      9.4     19.9     27.3     27.3 orderStatus
             34s        0          245.9          238.7     31.5     88.1    109.1    142.6 payment
             34s        0           32.0           25.4    113.2    167.8    318.8    318.8 stockLevel
        : signal: killed
    cluster.go:1652,cdc.go:221,cdc.go:376,test.go:1223: unexpected status: failed
danhhz commented 5 years ago

Failure is from an error that's not in the whitelist. We haven't backported the switch to a blacklist to release 19.1 yet #36132

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/668162cc99e4f3198b663b1abfa51858eeb3ccb8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212251&tab=buildLog

The test failed on master:
    cluster.go:1071,cdc.go:65,cdc.go:376,test.go:1223: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod put teamcity-1212251-cdc-tpcc-1000-rangefeed-true:4 /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload ./workload returned:
        stderr:

        stdout:
        teamcity-1212251-cdc-tpcc-1000-rangefeed-true: putting (dist) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload ./workload
        ........................................................................................................................
           4: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload root@35.231.237.7:./workload
        Connection closed by 35.231.237.7 port 22
        lost connection
        : exit status 1
        I190401 06:26:06.922832 1 cluster_synced.go:927  put /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/workload failed
        : exit status 1
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/a039a93a5cc6eb3f395ceb6f7dc8030985dccc29

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212269&tab=buildLog

The test failed on release-19.1:
    cdc.go:750,cdc.go:223,cdc.go:376,test.go:1223: initial scan did not complete
danhhz commented 5 years ago

initial scan did not complete well that's a bit concerning

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/2851c7d56ee4966109691b5c48c73ec8d4cc9847

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1215354&tab=buildLog

The test failed on master:
    cdc.go:859,cdc.go:224,cdc.go:473,test.go:1226: initial scan did not complete
danhhz commented 5 years ago

Finally got a chance to write up my triage of that initial scan failure. Looks like at least one span got stuck. Log from the end of the test:

ccl/changefeedccl/changefeed_processors.go:567  [n1] job 439196333105479681 span /Table/57/1/477/4/-969{-/0} is behind by 2h41m15.2891246s

One odd thing is that the first "behind by" message for this span is

I190401 07:08:59.781789 54191 ccl/changefeedccl/changefeed_processors.go:567  [n1] job 439196333105479681 span /Table/57/1/477/4/-969{-/0} is behind by 40m55.69528427s

but the changefeed was started with a cursor set to "2019-04-01 07:08:03.037043359". There's not a ton we can do based on this, without knowing what got stuck. Discussed some additional logging to add with @tbg this morning in our 1:1. PR coming when I get the cycles.

Dunno if this is related, but there are a few "slow heartbeat took" and "failed node liveness heartbeat" logs from when the changefeed starts up (probably the TBI-free initial scans). Maybe we need to adjust the default limit for how many we allow at once.

Also we see tons of health alerts for goroutines, but these are sadly expected with rangefeed-based changefeeds. Probably something to tune here

W190401 07:14:10.240815 458 server/node.go:880  [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:sys.goroutines Value:10092 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
danhhz commented 5 years ago

Looked at the failure from today. On this one, some of the watched ranges never make it to a non-zero resolved timestamp. Quite interestingly, the range in the message changes over time. The place it comes from spanFrontier sorts first by time, then by start key, so it looks like things are unblocking very slowly (or the ranges are splitting/merging, but this doesn't seem to be the case from the logs).

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/1cbf3680129e47bd310640bf32b665662f30faa9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217781&tab=buildLog

The test failed on release-19.1:
    cdc.go:867,cdc.go:225,cdc.go:481,test.go:1228: initial scan did not complete
cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/c6df752eefe4609b8a5bbada0955f79a2cfb790e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217763&tab=buildLog

The test failed on master:
    cdc.go:867,cdc.go:225,cdc.go:481,test.go:1228: initial scan did not complete
danhhz commented 5 years ago

Well it seems like this is easy to repro. I wonder what changed

danhhz commented 5 years ago

I wrote a new gauge to track how far behind the most behind closed timestamp is and ran the cdc/tpcc-1000 test. The changefeed is doing okay (end-to-end steady latency 16.065162494s; max steady latency so far 23.242592094s), but the gauge is saying that at least one replica on each node is stuck. So presumably it's not one of the ranges being watched by the changefeed. cc @tbg is this expected?

Screen Shot 2019-04-03 at 4 59 00 PM

Note I tested this with a local single node cluster and it was flat. I don't have a screenshot of this, but it also went down when I adjusted the closed timestamp interval down.

image (1)

In progress PR for the gauge which needs some cleanup https://github.com/cockroachdb/cockroach/compare/master...danhhz:cdc_slowrange_log?expand=1

tbg commented 5 years ago

Hmm. I think the root problem here will have to do with the MLAI map. If we don't have data for a range (i.e. we haven't received any MLAI for it which we need to match, and perhaps now the range is idle and isn't sending new ones) then we'll get stuck like that. Typically replicas call .Request to get out of a situation like that, but to be honest I'm struggling a bit to remember how they get in it in the first place. Any activity on a range leads to an update for that range to be sent to all connected peers, so early in the life of the cluster it's conceivable that we're not sending a few updates because the peer is not yet connected. If the range is dormant after that, no new updates will be sent, and so followers having missed the initial one would remain in that state until something calls .Request (which only an incoming request or lease change, etc, will). The code is here:

https://github.com/cockroachdb/cockroach/blob/c9683a39ab25588c7d6210f80622ada15172da8d/pkg/storage/replica_follower_read.go#L58-L60

https://github.com/cockroachdb/cockroach/blob/26f18270400d2a5b049ba4994f791040ed90fc75/pkg/storage/closedts/transport/clients.go#L66-L81

I sent a PR https://github.com/cockroachdb/cockroach/pull/36523 to expose a bit more of the closed timestamp in the range status, which you may want to peruse. You can python-grab the lowest ActiveClosedTimestamp and then you know what range to look at.

In the meantime, perhaps @ajwerner or @nvanbenschoten remember when you're expected to need to call .Request. It seems hard to guarantee that you never have to, but I'm not aware of common reasons to have to once the ct clients are fully connected between all nodes.

danhhz commented 5 years ago

Ran another tpcc-1000 with your PR as well as my gauge and saw the following. Dunno what to make of it. I'm 95% sure that when I did this yesterday, it was a solid line for every node. This is still during IMPORT where yesterday I was looking during the CHANGEFEED, so that could be the difference. In fact, I vaguely recall that closedts resets during splits or rebalances or something, which certainly occur during IMPORT.

Screen Shot 2019-04-04 at 9 07 15 AM
tbg commented 5 years ago

That's even weirder if you ask me. This looks like there's a replica most of the time that hasn't made progress "forever", but during the dips that replica isn't there. I'd expect that once we drop, the linear line has to start from that point onwards. Maybe this is about rebalancing? When a new replica is created, for a moment the active maxClosed may be the lease start timestamp (which is basically not moving unless there's a lease transfer) (but why? Shouldn't there be a MLAI for that replica in almost all cases?) Are you running the test on >3 nodes, i.e. is there even any rebalancing? (I'd start on three nodes to narrow this down).

You could print the RangeStatus for the "most behind range" printf-style to get a better idea.

danhhz commented 5 years ago

This is a 3 node cluster. The flopping didn't survive a restart to nodes with more logging, sadly. Now I'm seeing the original behavior. Looks like the most behind ranges are missing the newest_closed_timestamp and indeed inheriting the lease start (1554392854243628195 for both below).

I190404 21:35:27.069208 297 storage/store.go:4198  [n1,s1] WIP minMaxClosedTS 1554392854.243628195,0 5h47m52.825401835s {"state":{"raft_applied_index":24,"lease_applied_index":8,"desc":{"range_id":256,"start_key":"xImpj/cFdA==","end_key":"xImpj/cFdIg=","replicas":[{"node_id":1,"store_id":1,"replica_id":1},{"node_id":3,"store_id":3,"replica_id":2},{"node_id":2,"store_id":2,"replica_id":3}],"next_replica_id":4,"generation":1},"lease":{"start":{"wall_time":1554392854243628195},"replica":{"node_id":3,"store_id":3,"replica_id":2},"proposed_ts":{"wall_time":1554392854243632815},"epoch":1,"sequence":2},"truncated_state":{"index":10,"term":5},"gc_threshold":{},"stats":{"contains_estimates":false,"last_update_nanos":1554392872454974468,"intent_age":0,"gc_bytes_age":0,"live_bytes":0,"live_count":0,"key_bytes":0,"key_count":0,"val_bytes":0,"val_count":0,"intent_bytes":0,"intent_count":0,"sys_bytes":338,"sys_count":5},"txn_span_gc_threshold":{},"using_applied_state_key":true},"last_index":24,"range_max_bytes":67108864,"newest_closed_timestamp":{"closed_timestamp":{}},"active_closed_timestamp":{"wall_time":1554392854243628195}}

I'd like to clarify something. In a healthy cluster, should this new metric I added be approx the closed timestamp interval? Aka should every replica on every node be getting timestamps closed under normal conditions? Or is that what you were talking about with the .Request stuff? If so, does that need to be called once on each replica to kick stuff off or regularly?

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/2eb3f60e6f88a4f19ec7472f010d20287488f71d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1227248&tab=buildLog

The test failed on release-19.1:
    cdc.go:869,cdc.go:225,cdc.go:481,test.go:1237: initial scan did not complete
danhhz commented 5 years ago

1) "initial scan did not complete" here and above is actually a bug in the test. The initial scan did complete, but the roachtest only counts it as such once the resolved timestamp moves a second time. 2) I checked our new metric in the debug.zip. It says the range with the min resolved timestamp is indeed way back around when the test started. Sadly, this might be a false alarm because the range might not be one of the ones watched by the changefeed. However... 3) The range status stuff @tbg added does indicate that at the point the debug.zip was taken, at least some of the ranges had closed timestamps way behind because was missing from the node's closed timestamp storage. Range 1002, which is in table 58 and so definitely being watched by the changefeed:

    "newest_closed_timestamp": {
      "closed_timestamp": {}
    },
    "active_closed_timestamp": {
      "wall_time": 1554704872573832927
    }
Full info for that range:
{
  "span": {
    "start_key": "/Table/58/2/963/10/\"ڻ\\r\\xb0NPD\\\"\\x97^o\\x9f\\x84\\xfc\\x17\\x9f\"/PrefixEnd",
    "end_key": "/Table/58/2/963/10/\"\\xda\\xca,po\\xd7@ҕW\\xb4\\xc8\\xf0\\xf6;\\x82\"/0"
  },
  "raft_state": {
    "replica_id": 1,
    "hard_state": {
      "term": 10,
      "vote": 2,
      "commit": 21
    },
    "lead": 2,
    "state": "StateFollower",
    "applied": 21,
    "progress": null
  },
  "state": {
    "state": {
      "raft_applied_index": 21,
      "lease_applied_index": 5,
      "desc": {
        "range_id": 1002,
        "start_key": "wor3A8OSEtq7DbBOUEQil15vn4T8F58AAg==",
        "end_key": "wor3A8OSEtrKLHBv10DSlVe0yPD2O4IAAYg=",
        "replicas": [
          {
            "node_id": 1,
            "store_id": 1,
            "replica_id": 1
          },
          {
            "node_id": 2,
            "store_id": 2,
            "replica_id": 2
          },
          {
            "node_id": 3,
            "store_id": 3,
            "replica_id": 3
          }
        ],
        "next_replica_id": 4,
        "generation": 1
      },
      "lease": {
        "start": {
          "wall_time": 1554704531959801761
        },
        "replica": {
          "node_id": 2,
          "store_id": 2,
          "replica_id": 2
        },
        "proposed_ts": {
          "wall_time": 1554713910279390153
        },
        "epoch": 2,
        "sequence": 3
      },
      "truncated_state": {
        "index": 10,
        "term": 5
      },
      "gc_threshold": {},
      "stats": {
        "contains_estimates": false,
        "last_update_nanos": 1554704930825095362,
        "intent_age": 0,
        "gc_bytes_age": 0,
        "live_bytes": 0,
        "live_count": 0,
        "key_bytes": 0,
        "key_count": 0,
        "val_bytes": 0,
        "val_count": 0,
        "intent_bytes": 0,
        "intent_count": 0,
        "sys_bytes": 366,
        "sys_count": 4
      },
      "txn_span_gc_threshold": {},
      "using_applied_state_key": true
    },
    "last_index": 21,
    "raft_log_size": 2685,
    "raft_log_size_trusted": true,
    "range_max_bytes": 67108864,
    "newest_closed_timestamp": {
      "closed_timestamp": {}
    },
    "active_closed_timestamp": {
      "wall_time": 1554704872573832927
    }
  },
  "source_node_id": 1,
  "source_store_id": 1,
  "lease_history": [
    {
      "start": {
        "wall_time": 1554704531959801761
      },
      "replica": {
        "node_id": 2,
        "store_id": 2,
        "replica_id": 2
      },
      "proposed_ts": {
        "wall_time": 1554704531959804376
      },
      "epoch": 1,
      "sequence": 2
    },
    {
      "start": {
        "wall_time": 1554704531959801761
      },
      "replica": {
        "node_id": 2,
        "store_id": 2,
        "replica_id": 2
      },
      "proposed_ts": {
        "wall_time": 1554713910279390153
      },
      "epoch": 2,
      "sequence": 3
    }
  ],
  "problems": {},
  "stats": {
    "writes_per_second": 0.0006061913601019598
  },
  "latches_local": {},
  "latches_global": {},
  "lease_status": {
    "lease": {
      "start": {
        "wall_time": 1554704531959801761
      },
      "replica": {
        "node_id": 2,
        "store_id": 2,
        "replica_id": 2
      },
      "proposed_ts": {
        "wall_time": 1554713910279390153
      },
      "epoch": 2,
      "sequence": 3
    },
    "timestamp": {
      "wall_time": 1554713913181054479
    },
    "state": 1,
    "liveness": {
      "node_id": 2,
      "epoch": 2,
      "expiration": {
        "wall_time": 1554713922563714397,
        "logical": 0
      }
    }
  },
  "quiescent": true
}
ajwerner commented 5 years ago

The logic regarding which epoch to use when publishing updates feels sketchy (https://github.com/cockroachdb/cockroach/blob/6ab548361d310481fc724ab9be70c165a0a4487d/pkg/storage/closedts/provider/provider.go#L171). It makes sense that the code does not want to publish a closed timestamp that straddles epochs (hence the logic to publish with the lastEpoch as opposed to the current epoch) but it seems problematic if, as in this case, an epoch change leads to a new lease which leads to a quiescent range emitting just one MLAI and then never again. If that whole thing straddles two calls to Close then the emitted MLAI will still be pinned to the old epoch and never observed under the new one.

Maybe we need to publish the entire state of the MLAI map on epoch changes?

cc @tbg

ajwerner commented 5 years ago

I’ll propose three solutions:

1) track epochs with MLAIs

Right now we’re guessing that the current values belong to the epoch that preceded the current call to close. We have access to the epoch whenever we publish an MLAI so it wouldn’t be difficult to do. The negatives would be the overhead of potentially tracking multiple maps in the mpt and possibly needing to send multiple updates per call to Close.

2) Publish a complete update with information from all ranges when epochs change

This heavy and like a pretty big refactor but could work. We’d need to ensure that these updates don’t get included in the previous epoch. I don’t have a precise vision of how I’d expose this but as a sketch it could work. It risks being chatty even epochs start flying by, but so too does 1).

3) Detect when a closed time stamp is falling behind and explicitly request it.

This doesn’t feel too crazy to implement and seems like it may work but is unsatisfying. This solution papers over problems with lost updates on the closed time stamp subsystem without solving them or forcing anybody to understand them. This may be the simplest path forward with the least invasive changes.

1) feels right to me because we only ever read MLAI relative to an epoch but 3) seems okay as a quick fix. Maybe we should do 3) and then supplement it with 1) and adds metrics/telemetry?

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/57274fc2cd56e5e57957db59acbbcf4f35c118bc

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=cdc/tpcc-1000/rangefeed=true PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1229018&tab=buildLog

The test failed on master:
    cluster.go:1329,cdc.go:748,cdc.go:135,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1229018-cdc-tpcc-1000-rangefeed-true:4 -- ./workload run tpcc --warehouses=1000 --duration=120m --tolerate-errors {pgurl:1-3}  returned:
        stderr:

        stdout:
        : signal: killed
    cluster.go:1688,cdc.go:223,cdc.go:481,test.go:1237: unexpected node event: 2: dead
    cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1688,cdc.go:223,cdc.go:481,test.go:1237: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1229018-cdc-tpcc-1000-rangefeed-true --oneshot --ignore-empty-nodes: exit status 1 4: skipped
        2: dead
        1: 3943
        3: 4058
        Error:  2: dead
tbg commented 5 years ago

So what's the exact scenario here? Is it something like this:

  1. range 1 is quiescent, leaseholder is at liveness epoch 10, all nodes know the last MLAI for epoch 10
  2. lastEpoch=10 in https://github.com/cockroachdb/cockroach/blob/7037b544db927a2693fd94c2effc007a8d0dc2bf/pkg/storage/closedts/provider/provider.go#L166
  3. leaseholder loses liveness, acquires new lease at epoch 13 (or 11, doesn't matter) and the quiescent range grabs a new lease
  4. the provider's closer loop runs again, this time the epoch that comes back is 11, and it contains an MLAI for the lease request
  5. the MLAI is emitted under epoch 10, so it's useless to the receivers who will need to match epoch 10, and since the range is quiescent, a call to .Request is needed at each peer.

First of all, if there isn't any activity on the range, no MLAI will be emitted no matter what we do, and that will require an explicit call to .Request (or acquiring a lease, mod the timing problem above) anyway: the followers will continue to use the old epoch (from the lease) but that old epoch will be rotated out of the closedts storage in due time. I think for RangeFeed we explicitly call redirectOnOrAcquireLease to trigger "things". If we called .Request immediately after that, wouldn't we avoid the issue above without any further changes? edit: no, the MLAI could just land in the wrong epoch as discussed above.

If we want to make it less likely that calls to Request are necessary, associating epochs to each MLAI seems reasonable (so that the lease change would be associated to epoch 12), but do we even need all that? If we see MLAIs collected (over a closed timestamp Close period) with more than one epoch, won't we automatically know that those for the lower epoch are not going to be used? It seems that what we want to do in this code

https://github.com/cockroachdb/cockroach/blob/69add28cd4c2b4cf3473845e9f3636a99fa653b5/pkg/storage/closedts/minprop/tracker.go#L189-L202

is to ask the proposal to pass the epoch of the lease into the returned closure. Internally, we keep track of the max epoch seen and, if we see a higher epoch, we throw out the stuff we have so far (also, ignore anything new not matching the max epoch after). Then, when Close is called, it returns an epoch that is associated to all entries in the MLAI map, and this is used instead in this code:

https://github.com/cockroachdb/cockroach/blob/6ab548361d310481fc724ab9be70c165a0a4487d/pkg/storage/closedts/provider/provider.go#L162-L174

We still have to verify that the epoch is live at the closed timestamp.

ajwerner commented 5 years ago

First of all, if there isn't any activity on the range, no MLAI will be emitted no matter what we do, and that will require an explicit call to .Request (or acquiring a lease, mod the timing problem above) anyway: the followers will continue to use the old epoch (from the lease) but that old epoch will be rotated out of the closedts storage in due time. I think for RangeFeed we explicitly call redirectOnOrAcquireLease to trigger "things". If we called .Request immediately after that, wouldn't we avoid the issue above without any further changes? edit: no, the MLAI could just land in the wrong epoch as discussed above.

For the scenario in question, we have a RangeFeed established but then a change in node liveness epoch leads a new lease which does lead to an MLAI being produced but it ends up in the wrong Epoch given the logic for last Epoch.

is to ask the proposal to pass the epoch of the lease into the returned closure. Internally, we keep track of the max epoch seen and, if we see a higher epoch, we throw out the stuff we have so far (also, ignore anything new not matching the max epoch after). Then, when Close is called, it returns an epoch that is associated to all entries in the MLAI map, and this is used instead in this code:

This is a good idea! Much simpler than keeping all of the MLAIs corresponding to each Epoch. Do you see a reason not to make that change? I'd be happy to type it.

We could, in addition, defensively add code to explicitly Request when we detect in RangeFeed that we're failing to ratchet the resolved timestamp. That call could happen either here or underneath inside the rangefeed.Processor.

tbg commented 5 years ago

We could, in addition, defensively add code to explicitly Request when we detect in RangeFeed that we're failing to ratchet the resolved timestamp. That call could happen either here or underneath inside the rangefeed.Processor.

This is probably the only change that we should make for 19.1 (assuming it can fix things, which it should if it runs at some cadence at least). Let's try that out first, just to wrap up 19.1

This is a good idea! Much simpler than keeping all of the MLAIs corresponding to each Epoch. Do you see a reason not to make that change? I'd be happy to type it.

Absolutely, please do! That change should be good unless there's some snag we're missing (in which case you'll find out when you try to do it, hopefully). Let's also think about scenarios in which this isn't enough unless the caller also calls .Request. If the leaseholder changes, the new leaseholder calls EmitMLAI so we should be good (assuming the epoch we're adding to that is the right one, but seems like we should be able to manage). Then the only other scenario is the one in which everything is quiescent and there just isn't a lease for any live nodes' epoch. I guess this could still happen to RangeFeed (if it didn't call Request periodically on failures to close out): it calls redirectOnOrAcquireLease, but immediately after that the epoch for that lease times out, and the range is idle so nothing happens. We end up with a lease from a past epoch and nobody asking to get a new lease. Calling .Request will fix that by re-emitting the MLAI with its updated epoch.

danhhz commented 5 years ago

Fingers crossed this was closed by #36684