cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

roachtest: tpccbench/nodes=3/cpu=4 failed [discovered lock by different transaction than existing lock] #63592

Closed cockroach-teamcity closed 3 years ago

cockroach-teamcity commented 3 years ago

(roachtest).tpccbench/nodes=3/cpu=4 failed on release-20.2@e6faf24869a5b040ec79edd6bc87c6a0668cb5e6:

test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=4/run_1
    tpcc.go:873,tpcc.go:576,test_runner.go:755: monitor failure: unexpected node event: 3: dead
        (1) attached stack trace
          -- stack trace:
          | main.(*monitor).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2643
          | main.runTPCCBench.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:850
          | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
          | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
          | main.runTPCCBench
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:759
          | main.registerTPCCBenchSpec.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:576
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:755
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1374
        Wraps: (2) monitor failure
        Wraps: (3) unexpected node event: 3: dead
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

    cluster.go:1658,context.go:135,cluster.go:1647,test_runner.go:836: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2878627-1618340825-16-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
        3: dead
        1: 23207
        2: 20530
        Error: UNCLASSIFIED_PROBLEM: 3: dead
        (1) UNCLASSIFIED_PROBLEM
        Wraps: (2) attached stack trace
          -- stack trace:
          | main.glob..func14
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
          | main.wrap.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
          | github.com/spf13/cobra.(*Command).execute
          |     /home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
          | github.com/spf13/cobra.(*Command).ExecuteC
          |     /home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
          | github.com/spf13/cobra.(*Command).Execute
          |     /home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
          | main.main
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:204
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1374
        Wraps: (3) 3: dead
        Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
More

Artifacts: [/tpccbench/nodes=3/cpu=4](https://teamcity.cockroachdb.com/viewLog.html?buildId=2878627&tab=artifacts#/tpccbench/nodes=3/cpu=4) Related: - #63296 roachtest: tpccbench/nodes=3/cpu=4 failed [overload] [C-test-failure](https://api.github.com/repos/cockroachdb/cockroach/labels/C-test-failure) [O-roachtest](https://api.github.com/repos/cockroachdb/cockroach/labels/O-roachtest) [O-robot](https://api.github.com/repos/cockroachdb/cockroach/labels/O-robot) [branch-release-21.1](https://api.github.com/repos/cockroachdb/cockroach/labels/branch-release-21.1) [release-blocker](https://api.github.com/repos/cockroachdb/cockroach/labels/release-blocker) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Atpccbench%2Fnodes%3D3%2Fcpu%3D4.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)

nvanbenschoten commented 3 years ago

This looks like a real failure. It has the same symptoms as what we saw in https://github.com/cockroachdb/cockroach/issues/53540, though there is a fairly wide set of things that could go wrong to cause this assertion to fire.

F210413 21:52:12.822800 1382465 kv/kvserver/concurrency/concurrency_manager.go:263 ⋮ [n3,s3,r1386/2:‹/Table/5{8/1/341/6-9}›] discovered lock by different transaction than existing lock
goroutine 1382465 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x9728501, 0xc00ac00330, 0xc0c0000004, 0x8)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x9727640, 0x4, 0x1675896ac7e568b8, 0x151841, 0x896b2d1, 0x2e, 0x107, 0xc003954240, 0x3b, 0xc011554870, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:307 +0x1079
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x6345d80, 0xc041b0b260, 0x4, 0x2, 0x521e98c, 0x2, 0xc018e3c2f0, 0x1, 0x1)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:52 +0x1ac
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x6345d80, 0xc041b0b260, 0x1, 0x4, 0x521e98c, 0x2, 0xc018e3c2f0, 0x1, 0x1)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:165
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).HandleWriterIntentError(0xc0263ec280, 0x6345d80, 0xc041b0b260, 0xc00948dab0, 0xe, 0xc00b069600, 0xc00b069600, 0xc00b069601)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:263 +0x1cb
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleWriteIntentError(0xc00ac00000, 0x6345d80, 0xc041b0b260, 0xc009592360, 0xc00948dab0, 0x1675896a8f0df3dc, 0x0, 0x0, 0x300000003, 0x2, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:453 +0xaa
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc00ac00000, 0x6345d80, 0xc041b0b260, 0xc009592360, 0x5829dc8, 0x0, 0xc029b18050)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:374 +0x6ec
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc00ac00000, 0x6345d80, 0xc041b0b230, 0x56a, 0xc009592360, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:100 +0x5c2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc00ac00000, 0x6345d80, 0xc041b0b230, 0x167589691852e487, 0x3, 0x300000003, 0x2, 0x0, 0x56a, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:37 +0x93
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc000ea0700, 0x6345d80, 0xc041b0b1d0, 0x167589691852e487, 0x3, 0x300000003, 0x2, 0x0, 0x56a, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:194 +0x5af
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc000b5f500, 0x6345d80, 0xc041b0b1d0, 0x0, 0x0, 0x300000003, 0x2, 0x0, 0x56a, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:177 +0x108
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x6345d80, 0xc041b0b1a0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:898 +0x220
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc0002353b0, 0x6345d80, 0xc041b0b1a0, 0x524685b, 0x10, 0xc018e3d600, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0x140
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000a38b00, 0x6345d80, 0xc041b0b1a0, 0xc0095922d0, 0xc041b0b1a0, 0xc00004e000, 0x4ccd5a0)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:886 +0x197
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000a38b00, 0x6345d80, 0xc041b0b170, 0xc0095922d0, 0xc000a38b00, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:924 +0x9f
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler.func1(0x6345d80, 0xc041b0b170, 0x51ed5c0, 0xc0095922d0, 0x51ed5c0, 0xc0095922d0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:10692 +0x86
github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingServerInterceptor.func1(0x6345d80, 0xc041b0b170, 0x51ed5c0, 0xc0095922d0, 0xc021e0ca20, 0xc021e0ca40, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/server.go:44 +0x9ef
google.golang.org/grpc.getChainUnaryHandler.func1(0x6345d80, 0xc041b0b170, 0x51ed5c0, 0xc0095922d0, 0xc018e3da80, 0xf84028, 0x4f16320, 0xc008e5a980)
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:921 +0xe7
github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func1(0x6345d80, 0xc041b0b170, 0x51ed5c0, 0xc0095922d0, 0xc021e0ca20, 0xc008e5a980, 0xc008e5a980, 0x20, 0x4c89340, 0xc03b079d01)
    /go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:199 +0xa8
google.golang.org/grpc.chainUnaryServerInterceptors.func1(0x6345d80, 0xc041b0b170, 0x51ed5c0, 0xc0095922d0, 0xc021e0ca20, 0xc021e0ca40, 0xc03bd37b58, 0x978678, 0x4f32d80, 0xc041b0b170)
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:907 +0xca
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler(0x519f9a0, 0xc000a38b00, 0x6345d80, 0xc041b0b170, 0xc03b079da0, 0xc001243980, 0x6345d80, 0xc041b0b170, 0xc01dc7af00, 0x121)
    /go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:10694 +0x14b
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0000e0b60, 0x63ca3c0, 0xc002c0b680, 0xc014af6b00, 0xc00130a300, 0x8d706e0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1082 +0x4fd
google.golang.org/grpc.(*Server).handleStream(0xc0000e0b60, 0x63ca3c0, 0xc002c0b680, 0xc014af6b00, 0x0)
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1405 +0xd25
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0024f0190, 0xc0000e0b60, 0x63ca3c0, 0xc002c0b680, 0xc014af6b00)
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:746 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:744 +0xa1
tbg commented 3 years ago

Erik is going to spin up a repro machine a la #63533 to see what shakes out.

nvanbenschoten commented 3 years ago

I ran a series of 20 of these overnight. 10 of them passed and 10 of them hit the nightly clock sync issue we've been seeing. So I'm going to kick off more tests as well.

nvanbenschoten commented 3 years ago

The three potentially concerning changes between v20.2.7 and this SHA are:

erikgrinaker commented 3 years ago

I got sidetracked by a few other things and haven't had a chance to kick this off yet, so feel free to have a look. I'm a bit busy tonight, but will try to get it kicked off before bedtime.

erikgrinaker commented 3 years ago

Ran 20 overnight as well, with core dumps enabled, but saw no failures. Will kick off another 100.

erikgrinaker commented 3 years ago

Laptop lost network connection, but ran 30 overnight with nothing but clock failures. Doing another 200 in batches of 50 in andrei-jepsen.

nvanbenschoten commented 3 years ago

I also ran 20 more successful runs over the weekend. This is turning out to be quite difficult to reproduce.

dt commented 3 years ago

did we decide this morning that given the difficulty in reproducing and other urgent 20.2 fixes, we were going to remove the release-blocker status from this one?

nvanbenschoten commented 3 years ago

Yes, we decided to proceed with the release despite this issue, given the severity of some of the other fixes. I'm hesitant to remove the release-blocker label, though, because I think we would still want this to block any future patch release and we plan to continue investigating.

erikgrinaker commented 3 years ago

Did another 200, no failures. Kicking off another 1000 in batches of 100.

Do we have any ideas of what might be causing this, so that we can try to provoke a failure?

erikgrinaker commented 3 years ago

I got through 850 runs before clock drift took out the clusters, no signs of this error (or any other non-clock failures). I'm not sure how fruitful this approach will be -- should we just keep going?

nvanbenschoten commented 3 years ago

In the logs of the failure, we see that one of the nodes was badly overloaded while running with --active-warehouses=745. We saw in other investigations recently that it can be helpful to pin the warehouse count to the number we saw the original issue on. This is what @tbg did in https://github.com/tbg/cockroach/commit/92a8ad8be37f2c8bd8fd31bb65c7fb34cec4236c - we should try doing to same here (including the WAL archiving and capturing a core dump on fatal), at least for one last pass.

Do we have any ideas of what might be causing this, so that we can try to provoke a failure?

It's not immediately clear. This lock table state is shadowing the value of intents in the persistent state machine. The fatal means that something got out of whack, but it's not clear where. Last time we hit this, the bug ended up being in Pebble https://github.com/cockroachdb/cockroach/issues/53540. The time before that, it was a bug with the lock table's state transitions: https://github.com/cockroachdb/cockroach/issues/50996.

What do you think about updating this fatal message to include more information about the two transactions to ensure that we can better debug this if we hit it again here or in the wild? If we do so, we'll want to make sure that that enough information is marked as safe so that it doesn't get redacted out of customer logs or sentry reports.

erikgrinaker commented 3 years ago

We saw in other investigations recently that it can be helpful to pin the warehouse count to the number we saw the original issue on. This is what @tbg did in tbg@92a8ad8 - we should try doing to same here (including the WAL archiving and capturing a core dump on fatal), at least for one last pass.

Good idea, I've started another 1000 runs with these changes.

What do you think about updating this fatal message to include more information about the two transactions to ensure that we can better debug this if we hit it again here or in the wild?

Submitted #64130.

erikgrinaker commented 3 years ago

Did another 1000 runs, no failures. While this seems like a potentially serious bug, I'm not sure how to investigate it without a reproduction or more info, so I suggest we add details to the error message in #64130 and wait for it to crop up again at some point.

tbg commented 3 years ago

Sounds good to me. I think we should close this issue then, too. Let's refer to #63592 (i.e. this issue) in the message added in #64130 if we don't already, so that we can pick up the thread again if it ever fires again.

erikgrinaker commented 3 years ago

Let's refer to #63592 (i.e. this issue) in the message added in #64130

Good idea, done.

jordanlewis commented 3 years ago

I don't think this could be explained by #64228 if this test wasn't running interleaved tables, unless of course there are other places where we mutate keys returned by batches from KV. I think this is somewhat unlikely, but we should probably find a way to test this assumption holistically.