Closed cockroach-teamcity closed 3 years ago
This looks like a real failure. It has the same symptoms as what we saw in https://github.com/cockroachdb/cockroach/issues/53540, though there is a fairly wide set of things that could go wrong to cause this assertion to fire.
F210413 21:52:12.822800 1382465 kv/kvserver/concurrency/concurrency_manager.go:263 ⋮ [n3,s3,r1386/2:‹/Table/5{8/1/341/6-9}›] discovered lock by different transaction than existing lock
goroutine 1382465 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x9728501, 0xc00ac00330, 0xc0c0000004, 0x8)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x9727640, 0x4, 0x1675896ac7e568b8, 0x151841, 0x896b2d1, 0x2e, 0x107, 0xc003954240, 0x3b, 0xc011554870, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:307 +0x1079
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x6345d80, 0xc041b0b260, 0x4, 0x2, 0x521e98c, 0x2, 0xc018e3c2f0, 0x1, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:52 +0x1ac
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x6345d80, 0xc041b0b260, 0x1, 0x4, 0x521e98c, 0x2, 0xc018e3c2f0, 0x1, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:165
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).HandleWriterIntentError(0xc0263ec280, 0x6345d80, 0xc041b0b260, 0xc00948dab0, 0xe, 0xc00b069600, 0xc00b069600, 0xc00b069601)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:263 +0x1cb
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleWriteIntentError(0xc00ac00000, 0x6345d80, 0xc041b0b260, 0xc009592360, 0xc00948dab0, 0x1675896a8f0df3dc, 0x0, 0x0, 0x300000003, 0x2, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:453 +0xaa
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc00ac00000, 0x6345d80, 0xc041b0b260, 0xc009592360, 0x5829dc8, 0x0, 0xc029b18050)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:374 +0x6ec
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc00ac00000, 0x6345d80, 0xc041b0b230, 0x56a, 0xc009592360, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:100 +0x5c2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc00ac00000, 0x6345d80, 0xc041b0b230, 0x167589691852e487, 0x3, 0x300000003, 0x2, 0x0, 0x56a, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:37 +0x93
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc000ea0700, 0x6345d80, 0xc041b0b1d0, 0x167589691852e487, 0x3, 0x300000003, 0x2, 0x0, 0x56a, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:194 +0x5af
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc000b5f500, 0x6345d80, 0xc041b0b1d0, 0x0, 0x0, 0x300000003, 0x2, 0x0, 0x56a, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:177 +0x108
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x6345d80, 0xc041b0b1a0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:898 +0x220
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc0002353b0, 0x6345d80, 0xc041b0b1a0, 0x524685b, 0x10, 0xc018e3d600, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0x140
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000a38b00, 0x6345d80, 0xc041b0b1a0, 0xc0095922d0, 0xc041b0b1a0, 0xc00004e000, 0x4ccd5a0)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:886 +0x197
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000a38b00, 0x6345d80, 0xc041b0b170, 0xc0095922d0, 0xc000a38b00, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:924 +0x9f
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler.func1(0x6345d80, 0xc041b0b170, 0x51ed5c0, 0xc0095922d0, 0x51ed5c0, 0xc0095922d0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:10692 +0x86
github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingServerInterceptor.func1(0x6345d80, 0xc041b0b170, 0x51ed5c0, 0xc0095922d0, 0xc021e0ca20, 0xc021e0ca40, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/server.go:44 +0x9ef
google.golang.org/grpc.getChainUnaryHandler.func1(0x6345d80, 0xc041b0b170, 0x51ed5c0, 0xc0095922d0, 0xc018e3da80, 0xf84028, 0x4f16320, 0xc008e5a980)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:921 +0xe7
github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func1(0x6345d80, 0xc041b0b170, 0x51ed5c0, 0xc0095922d0, 0xc021e0ca20, 0xc008e5a980, 0xc008e5a980, 0x20, 0x4c89340, 0xc03b079d01)
/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:199 +0xa8
google.golang.org/grpc.chainUnaryServerInterceptors.func1(0x6345d80, 0xc041b0b170, 0x51ed5c0, 0xc0095922d0, 0xc021e0ca20, 0xc021e0ca40, 0xc03bd37b58, 0x978678, 0x4f32d80, 0xc041b0b170)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:907 +0xca
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler(0x519f9a0, 0xc000a38b00, 0x6345d80, 0xc041b0b170, 0xc03b079da0, 0xc001243980, 0x6345d80, 0xc041b0b170, 0xc01dc7af00, 0x121)
/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:10694 +0x14b
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0000e0b60, 0x63ca3c0, 0xc002c0b680, 0xc014af6b00, 0xc00130a300, 0x8d706e0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1082 +0x4fd
google.golang.org/grpc.(*Server).handleStream(0xc0000e0b60, 0x63ca3c0, 0xc002c0b680, 0xc014af6b00, 0x0)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1405 +0xd25
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0024f0190, 0xc0000e0b60, 0x63ca3c0, 0xc002c0b680, 0xc014af6b00)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:746 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:744 +0xa1
Erik is going to spin up a repro machine a la #63533 to see what shakes out.
I ran a series of 20 of these overnight. 10 of them passed and 10 of them hit the nightly clock sync issue we've been seeing. So I'm going to kick off more tests as well.
The three potentially concerning changes between v20.2.7 and this SHA are:
mergingIter.SeekPrefixGE
: https://github.com/cockroachdb/cockroach/pull/62907I got sidetracked by a few other things and haven't had a chance to kick this off yet, so feel free to have a look. I'm a bit busy tonight, but will try to get it kicked off before bedtime.
Ran 20 overnight as well, with core dumps enabled, but saw no failures. Will kick off another 100.
Laptop lost network connection, but ran 30 overnight with nothing but clock failures. Doing another 200 in batches of 50 in andrei-jepsen
.
I also ran 20 more successful runs over the weekend. This is turning out to be quite difficult to reproduce.
did we decide this morning that given the difficulty in reproducing and other urgent 20.2 fixes, we were going to remove the release-blocker status from this one?
Yes, we decided to proceed with the release despite this issue, given the severity of some of the other fixes. I'm hesitant to remove the release-blocker label, though, because I think we would still want this to block any future patch release and we plan to continue investigating.
Did another 200, no failures. Kicking off another 1000 in batches of 100.
Do we have any ideas of what might be causing this, so that we can try to provoke a failure?
I got through 850 runs before clock drift took out the clusters, no signs of this error (or any other non-clock failures). I'm not sure how fruitful this approach will be -- should we just keep going?
In the logs of the failure, we see that one of the nodes was badly overloaded while running with --active-warehouses=745
. We saw in other investigations recently that it can be helpful to pin the warehouse count to the number we saw the original issue on. This is what @tbg did in https://github.com/tbg/cockroach/commit/92a8ad8be37f2c8bd8fd31bb65c7fb34cec4236c - we should try doing to same here (including the WAL archiving and capturing a core dump on fatal), at least for one last pass.
Do we have any ideas of what might be causing this, so that we can try to provoke a failure?
It's not immediately clear. This lock table state is shadowing the value of intents in the persistent state machine. The fatal means that something got out of whack, but it's not clear where. Last time we hit this, the bug ended up being in Pebble https://github.com/cockroachdb/cockroach/issues/53540. The time before that, it was a bug with the lock table's state transitions: https://github.com/cockroachdb/cockroach/issues/50996.
What do you think about updating this fatal message to include more information about the two transactions to ensure that we can better debug this if we hit it again here or in the wild? If we do so, we'll want to make sure that that enough information is marked as safe so that it doesn't get redacted out of customer logs or sentry reports.
We saw in other investigations recently that it can be helpful to pin the warehouse count to the number we saw the original issue on. This is what @tbg did in tbg@92a8ad8 - we should try doing to same here (including the WAL archiving and capturing a core dump on fatal), at least for one last pass.
Good idea, I've started another 1000 runs with these changes.
What do you think about updating this fatal message to include more information about the two transactions to ensure that we can better debug this if we hit it again here or in the wild?
Submitted #64130.
Did another 1000 runs, no failures. While this seems like a potentially serious bug, I'm not sure how to investigate it without a reproduction or more info, so I suggest we add details to the error message in #64130 and wait for it to crop up again at some point.
Sounds good to me. I think we should close this issue then, too. Let's refer to #63592 (i.e. this issue) in the message added in #64130 if we don't already, so that we can pick up the thread again if it ever fires again.
Let's refer to #63592 (i.e. this issue) in the message added in #64130
Good idea, done.
I don't think this could be explained by #64228 if this test wasn't running interleaved tables, unless of course there are other places where we mutate keys returned by batches from KV. I think this is somewhat unlikely, but we should probably find a way to test this assumption holistically.
(roachtest).tpccbench/nodes=3/cpu=4 failed on release-20.2@e6faf24869a5b040ec79edd6bc87c6a0668cb5e6:
More
Artifacts: [/tpccbench/nodes=3/cpu=4](https://teamcity.cockroachdb.com/viewLog.html?buildId=2878627&tab=artifacts#/tpccbench/nodes=3/cpu=4) Related: - #63296 roachtest: tpccbench/nodes=3/cpu=4 failed [overload] [C-test-failure](https://api.github.com/repos/cockroachdb/cockroach/labels/C-test-failure) [O-roachtest](https://api.github.com/repos/cockroachdb/cockroach/labels/O-roachtest) [O-robot](https://api.github.com/repos/cockroachdb/cockroach/labels/O-robot) [branch-release-21.1](https://api.github.com/repos/cockroachdb/cockroach/labels/branch-release-21.1) [release-blocker](https://api.github.com/repos/cockroachdb/cockroach/labels/release-blocker) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Atpccbench%2Fnodes%3D3%2Fcpu%3D4.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)