cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.2k stars 3.82k forks source link

roachtest: cdc/bank failed #63282

Closed cockroach-teamcity closed 3 years ago

cockroach-teamcity commented 3 years ago

roachtest.cdc/bank failed with artifacts on master @ 704a9fa4ffd144a2bc977bc1a6853a853b17e70e:

          |   505.0s        0           10.0           10.0      3.4      5.0      5.0      5.0 transfer
          |   506.0s        0           10.0           10.0      4.5      5.5      5.5      5.5 transfer
          |   507.0s        0           10.0           10.0      4.2      6.8      6.8      6.8 transfer
          |   508.0s        0           10.0           10.0      3.9      4.2      4.2      4.2 transfer
          |   509.0s        0           10.0           10.0      5.0      7.3      7.3      7.3 transfer
          |   510.0s        0           10.0           10.0      4.2      8.9      8.9      8.9 transfer
          |   511.0s        0           10.0           10.0      3.9      6.0      6.0      6.0 transfer
          |   512.0s        0           10.0           10.0      4.2      5.2      5.2      5.2 transfer
          |   513.0s        0           10.0           10.0      4.7      5.8      5.8      5.8 transfer
          |   514.0s        0           10.0           10.0      4.5     15.7     15.7     15.7 transfer
          |   515.0s        0           10.0           10.0      4.5      5.0      5.0      5.0 transfer
        Wraps: (10) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

    cluster.go:2271,cdc.go:1223,panic.go:969,test.go:298,test.go:273,cluster.go:2705,cdc.go:367,cdc.go:696,test_runner.go:793: output in run_061045.267_n4_rm: cluster.RunL: context canceled
        (1) attached stack trace
          -- stack trace:
          | main.(*cluster).RunE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2349
          | [...repeated from below...]
        Wraps: (2) output in run_061045.267_n4_rm
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*cluster).RunL
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2355
          | main.(*cluster).RunE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2336
          | main.(*cluster).Run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2269
          | main.kafkaManager.stop
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cdc.go:1223
          | runtime.gopanic
          |     /usr/local/go/src/runtime/panic.go:969
          | main.(*test).fatalfInner
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test.go:298
          | main.(*test).Fatal
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test.go:273
          | main.(*monitor).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2705
          | main.runCDCBank
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cdc.go:367
          | main.registerCDC.func8
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cdc.go:696
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:793
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1374
        Wraps: (4) cluster.RunL
        Wraps: (5) context canceled
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Reproduce

To reproduce, try: ```bash #!/usr/bin/env bash #!/bin/bash set -euxo pipefail # NB: invoke this script with "caffeinate" on OSX and/or linux to # prevent runs failing due to standby. sha=$(git rev-parse HEAD) if [ ! -f roachtest.$sha ]; then ./build/builder.sh mkrelease amd64-linux-gnu bin/{roach{prod,test},workload} mv -f bin.docker_amd64/roachprod roachprod.$sha mv -f bin.docker_amd64/workload workload.$sha mv -f bin.docker_amd64/roachtest roachtest.$sha fi if [ ! -f cockroach.$sha ]; then ./build/builder.sh mkrelease amd64-linux-gnu mv cockroach-linux-2.6.32-gnu-amd64 cockroach.$sha fi # NB: consider adding --debug if it is useful to let the clusters # for failed tests survive. ./roachtest.$sha run "cdc/bank" \ --port 8080 --count 10 --cpu-quota 500 \ --roachprod roachprod.${sha} --workload workload.${sha} \ --cockroach ./cockroach.$sha \ --artifacts artifacts.$sha | tee roachtest-stress.${sha} ```

/cc @cockroachdb/cdc

Internal log

``` no author provided ```

See this test on roachdash

Improve this report!

stevendanna commented 3 years ago

This was at least in part clock uncertainty:

Error: pq: remote wall time is too far ahead (904.294361ms) to be trustworthy
Error: COMMAND_PROBLEM: exit status 1
(1) COMMAND_PROBLEM
Wraps: (2) Node 4. Command with error:
  | ```
  | ./workload run bank {pgurl:1} --max-rate=10
  | ```
Wraps: (3) exit status 1
Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
run_060208.084_n4_workload_run_bank: 06:10:45 cluster.go:2337: > result: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2859839-1617861430-04-n4cpu4:4 -- ./workload run bank {pgurl:1} --max-rate=10 returned: exit status 20

However, I think this potentially revealed some behavior we likely want to understand. Namely, when node 2 panic'd because of clock sync, the schemafeed also died on node 1:

teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677  CHANGEFEED job ‹×› returning with error: fetching changes for ‹×›: ‹×›
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +(1) tags: [n1,received-error]
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +Wraps: (2) tags: [n‹×›,sent-error=‹×›]
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +Wraps: (3)
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  | (opaque error wrapper)
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  | reportable 0:
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  |
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed.(*SchemaFeed).fetchDescriptorVersions
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed/schema_feed.go:569
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed.(*SchemaFeed).updateTableHistory
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed/schema_feed.go:297
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed.(*SchemaFeed).pollTableHistory
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed/schema_feed.go:280
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed.(*SchemaFeed).Run
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed/schema_feed.go:242
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  | github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:166
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  | golang.org/x/sync/errgroup.(*Group).Go.func1
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  | runtime.goexit
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +  |    /usr/local/go/src/runtime/asm_amd64.s:1374
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +Wraps: (4) fetching changes for ‹×›
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +Wraps: (5) ‹×›
teamcity-2859839-1617861430-04-n4cpu4-0001> W210408 06:10:45.183551 7949 ccl/changefeedccl/changefeed_stmt.go:609 ⋮ [n1,job=‹×›] 677 +Error types: (1) *contexttags.withContext (2) *contexttags.withContext (3) *errbase.opaqueWrapper (4) *errutil.withPrefix (5) *errors.errorString
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 ⋮ [n1] 678  CHANGEFEED job ‹×›: stepping through state reverting with error: fetching changes for ‹×›: ‹×›
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +(1) tags: [n1,received-error]
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 ⋮ [n1] 678 +Wraps: (2) tags: [n‹×›,sent-error=‹×›]
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +Wraps: (3)
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  | (opaque error wrapper)
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  | reportable 0:
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  |
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed.(*SchemaFeed).fetchDescriptorVersions
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed/schema_feed.go:569
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed.(*SchemaFeed).updateTableHistory
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed/schema_feed.go:297
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed.(*SchemaFeed).pollTableHistory
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed/schema_feed.go:280
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed.(*SchemaFeed).Run
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/schemafeed/schema_feed.go:242
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  | github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:166
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  | golang.org/x/sync/errgroup.(*Group).Go.func1
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  | runtime.goexit
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +  |    /usr/local/go/src/runtime/asm_amd64.s:1374
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 ⋮ [n1] 678 +Wraps: (4) fetching changes for ‹×›
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 ⋮ [n1] 678 +Wraps: (5) ‹×›
teamcity-2859839-1617861430-04-n4cpu4-0001> I210408 06:10:45.183767 7949 jobs/registry.go:1175 â‹® [n1] 678 +Error types: (1) *contexttags.withContext (2) *contexttags.withContext (3) *errbase.opaqueWrapper (4) *errutil.withPrefix (5) *errors.errorString
teamcity-2859839-1617861430-04-n4cpu4-0001> E210408 06:10:45.185192 7949 jobs/adopt.go:260 ⋮ [n1] 679  job ‹×›: adoption completed with error job ‹×›: could not mark as reverting: fetching changes for ‹×›: ‹×›: ‹×›: ‹×›

This appears to be a non-retriable error. But, ideally changefeeds, like the rest of the cluster 3 node cluster, should be able to survive a ~two~ one node failure.

tbg commented 3 years ago

Do you mean one node failure? If two out of three nodes died, I'd expect the job to fail, or at least never finish.

stevendanna commented 3 years ago

Do you mean one node failure?

I did! Thanks :D.

stevendanna commented 3 years ago

I opened https://github.com/cockroachdb/cockroach/issues/63317 to cover the other issue that was found here. Given what I believe the cause is, this is existing behaviour, I don't think it needs to be a release blocker but I'll consult with the rest of the team.

shermanCRL commented 3 years ago

Addressed in https://github.com/cockroachdb/cockroach/issues/63317