cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

kvserver: v22.2.5: while applying snapshot: while applying snapshot: while ingesting: link × ×: file exists #97341

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://cockroach-labs.sentry.io/issues/3946779534/?referrer=webhooks_plugin

Panic message:

store_raft.go:466: log.Fatal: while applying snapshot: while applying snapshot: while ingesting [× × × × × ×]: link × ×: file exists (1) attached stack trace -- stack trace: | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).processRaftSnapshotRequest.func1 | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:466 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).withReplicaForRequest | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:344 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).processRaftSnapshotRequest | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:403 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).receiveSnapshot | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_snapshot.go:1081 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).HandleSnapshot.func1 | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:213 | github.com/cockroachdb/cockroach/pkg/util/stop.(Stopper).RunTaskWithErr | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:341 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).HandleSnapshot | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:210 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(RaftTransport).RaftSnapshot | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:379 | github.com/cockroachdb/cockroach/pkg/kv/kvserver._MultiRaft_RaftSnapshot_Handler | github.com/cockroachdb/cockroach/pkg/kv/kvserver/bazel-out/k8-opt/bin/pkg/kv/kvserver/kvserver_goproto/github.com/cockroachdb/cockroach/pkg/kv/kvserver/storage_services.pb.go:270 | github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.StreamServerInterceptor.func1 | github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:163 | google.golang.org/grpc.chainStreamInterceptors.func1.1 | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1408 | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func4 | github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:272 | google.golang.org/grpc.chainStreamInterceptors.func1.1 | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1411 | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func2.1 | github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:241 | github.com/cockroachdb/cockroach/pkg/util/stop.(Stopper).RunTaskWithErr | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:341 | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func2 | github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:240 | google.golang.org/grpc.chainStreamInterceptors.func1.1 | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1411 | google.golang.org/grpc.chainStreamInterceptors.func1 | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1413 | google.golang.org/grpc.(Server).processStreamingRPC | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1549 | google.golang.org/grpc.(Server).handleStream | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1624 | google.golang.org/grpc.(Server).serveStreams.func1.2 | google.golang.org/grpc/external/org_golang_google_grpc/server.go:922 | runtime.goexit | src/runtime/asm_amd64.s:1594 Wraps: (2) secondary error attachment | while applying snapshot: while ingesting [× × × × × ×]: link × ×: file exists | (1) attached stack trace | -- stack trace: | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Replica).handleRaftReadyRaftMuLocked | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:794 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).processRaftSnapshotRequest.func1 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:465 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).withReplicaForRequest | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:344 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).processRaftSnapshotRequest | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:403 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).receiveSnapshot | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_snapshot.go:1081 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).HandleSnapshot.func1 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:213 | | github.com/cockroachdb/cockroach/pkg/util/stop.(Stopper).RunTaskWithErr | | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:341 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).HandleSnapshot | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:210 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(RaftTransport).RaftSnapshot | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go:379 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver._MultiRaft_RaftSnapshot_Handler | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/bazel-out/k8-opt/bin/pkg/kv/kvserver/kvserver_goproto/github.com/cockroachdb/cockroach/pkg/kv/kvserver/storage_services.pb.go:270 | | github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.StreamServerInterceptor.func1 | | github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:163 | | google.golang.org/grpc.chainStreamInterceptors.func1.1 | | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1408 | | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func4 | | github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:272 | | google.golang.org/grpc.chainStreamInterceptors.func1.1 | | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1411 | | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func2.1 | | github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:241 | | github.com/cockroachdb/cockroach/pkg/util/stop.(Stopper).RunTaskWithErr | | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:341 | | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func2 | | github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:240 | | google.golang.org/grpc.chainStreamInterceptors.func1.1 | | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1411 | | google.golang.org/grpc.chainStreamInterceptors.func1 | | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1413 | | google.golang.org/grpc.(Server).processStreamingRPC | | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1549 | | google.golang.org/grpc.(Server).handleStream | | google.golang.org/grpc/external/org_golang_google_grpc/server.go:1624 | | google.golang.org/grpc.(Server).serveStreams.func1.2 | | google.golang.org/grpc/external/org_golang_google_grpc/server.go:922 | Wraps: (2) while applying snapshot | Wraps: (3) attached stack trace | -- stack trace: | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Replica).applySnapshot | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raftstorage.go:966 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Replica).handleRaftReadyRaftMuLocked | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:792 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Store).processRaftSnapshotRequest.func1 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:465 | | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).withReplicaForRequest | | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:344 | | gith...

Stacktrace (expand for inline code snippets): https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L465-L467 in pkg/kv/kvserver.(*Store).processRaftSnapshotRequest.func1 https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L343-L345 in pkg/kv/kvserver.(*Store).withReplicaForRequest https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L402-L404 in pkg/kv/kvserver.(*Store).processRaftSnapshotRequest https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/kv/kvserver/pkg/kv/kvserver/store_snapshot.go#L1080-L1082 in pkg/kv/kvserver.(*Store).receiveSnapshot https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L212-L214 in pkg/kv/kvserver.(*Store).HandleSnapshot.func1 https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/util/stop/stopper.go#L340-L342 in pkg/util/stop.(*Stopper).RunTaskWithErr https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L209-L211 in pkg/kv/kvserver.(*Store).HandleSnapshot https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go#L378-L380 in pkg/kv/kvserver.(*RaftTransport).RaftSnapshot https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/kv/kvserver/storage_services.pb.go#L269-L271 in pkg/kv/kvserver._MultiRaft_RaftSnapshot_Handler https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go#L162-L164 in pkg/util/tracing/grpcinterceptor.StreamServerInterceptor.func1 google.golang.org/grpc/external/org_golang_google_grpc/server.go#L1407-L1409 in google.golang.org/grpc.chainStreamInterceptors.func1.1 https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/rpc/pkg/rpc/context.go#L271-L273 in pkg/rpc.NewServerEx.func4 google.golang.org/grpc/external/org_golang_google_grpc/server.go#L1410-L1412 in google.golang.org/grpc.chainStreamInterceptors.func1.1 https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/rpc/pkg/rpc/context.go#L240-L242 in pkg/rpc.NewServerEx.func2.1 https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/util/stop/stopper.go#L340-L342 in pkg/util/stop.(*Stopper).RunTaskWithErr https://github.com/cockroachdb/cockroach/blob/0c6903954dc9cd6c38c78ce5192cfd9a8183c110/pkg/rpc/pkg/rpc/context.go#L239-L241 in pkg/rpc.NewServerEx.func2 google.golang.org/grpc/external/org_golang_google_grpc/server.go#L1410-L1412 in google.golang.org/grpc.chainStreamInterceptors.func1.1 google.golang.org/grpc/external/org_golang_google_grpc/server.go#L1412-L1414 in google.golang.org/grpc.chainStreamInterceptors.func1 google.golang.org/grpc/external/org_golang_google_grpc/server.go#L1548-L1550 in google.golang.org/grpc.(*Server).processStreamingRPC google.golang.org/grpc/external/org_golang_google_grpc/server.go#L1623-L1625 in google.golang.org/grpc.(*Server).handleStream google.golang.org/grpc/external/org_golang_google_grpc/server.go#L921-L923 in google.golang.org/grpc.(*Server).serveStreams.func1.2 src/runtime/asm_amd64.s#L1593-L1595 in runtime.goexit
pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go in pkg/kv/kvserver.(*Store).processRaftSnapshotRequest.func1 at line 466
pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go in pkg/kv/kvserver.(*Store).withReplicaForRequest at line 344
pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go in pkg/kv/kvserver.(*Store).processRaftSnapshotRequest at line 403
pkg/kv/kvserver/pkg/kv/kvserver/store_snapshot.go in pkg/kv/kvserver.(*Store).receiveSnapshot at line 1081
pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go in pkg/kv/kvserver.(*Store).HandleSnapshot.func1 at line 213
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunTaskWithErr at line 341
pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go in pkg/kv/kvserver.(*Store).HandleSnapshot at line 210
pkg/kv/kvserver/pkg/kv/kvserver/raft_transport.go in pkg/kv/kvserver.(*RaftTransport).RaftSnapshot at line 379
pkg/kv/kvserver/storage_services.pb.go in pkg/kv/kvserver._MultiRaft_RaftSnapshot_Handler at line 270
pkg/util/tracing/grpcinterceptor/grpc_interceptor.go in pkg/util/tracing/grpcinterceptor.StreamServerInterceptor.func1 at line 163
google.golang.org/grpc/external/org_golang_google_grpc/server.go in google.golang.org/grpc.chainStreamInterceptors.func1.1 at line 1408
pkg/rpc/pkg/rpc/context.go in pkg/rpc.NewServerEx.func4 at line 272
google.golang.org/grpc/external/org_golang_google_grpc/server.go in google.golang.org/grpc.chainStreamInterceptors.func1.1 at line 1411
pkg/rpc/pkg/rpc/context.go in pkg/rpc.NewServerEx.func2.1 at line 241
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunTaskWithErr at line 341
pkg/rpc/pkg/rpc/context.go in pkg/rpc.NewServerEx.func2 at line 240
google.golang.org/grpc/external/org_golang_google_grpc/server.go in google.golang.org/grpc.chainStreamInterceptors.func1.1 at line 1411
google.golang.org/grpc/external/org_golang_google_grpc/server.go in google.golang.org/grpc.chainStreamInterceptors.func1 at line 1413
google.golang.org/grpc/external/org_golang_google_grpc/server.go in google.golang.org/grpc.(*Server).processStreamingRPC at line 1549
google.golang.org/grpc/external/org_golang_google_grpc/server.go in google.golang.org/grpc.(*Server).handleStream at line 1624
google.golang.org/grpc/external/org_golang_google_grpc/server.go in google.golang.org/grpc.(*Server).serveStreams.func1.2 at line 922
src/runtime/asm_amd64.s in runtime.goexit at line 1594
Tag Value
Cockroach Release v22.2.5
Cockroach SHA: 0c6903954dc9cd6c38c78ce5192cfd9a8183c110
Platform darwin amd64
Distribution CCL
Environment v22.2.5
Command server
Go Version ``
# of CPUs
# of Goroutines

Jira issue: CRDB-24646

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

tbg commented 1 year ago

This comes from IngestExternalFilesWithStats:

https://github.com/cockroachdb/cockroach/blob/3fe8ffc7f854b6ff9689735dbb1e6ba6e661e027/pkg/kv/kvserver/replica_raftstorage.go#L964-L967

I assume internally it's trying to hard-link the SST into the LSM, and is finding that an SST of that seqno already exists.

This shouldn't have anything to do with the SSTs we passed in, as SSTs are usually assigned a counter?

Note that this is darwin, so it's not high on our priority list, though we'd want to make sure there isn't a general bug in IngestExternalFilesWithStats.

Handing this over to the storage team in case they want to look into it more before closing out.

jbowens commented 1 year ago

There were two events for different nodes (n2 and n3) of the same cluster within 1 second of one another.

I double-checked the code; the logic here is pretty simple. We increment a counter while holding a mutex to obtain unique file numbers. During Open, we ratchet the next file number up above beyond the largest file number in the directory, so even a stray sstable that's not part of the LSM cannot lead to this error.

The only pathway to this error that I can see is adding a sstable to the directory while the engine is already open. The user could've manually copied a higher-numbered sstable into the directory, or two cockroach processes could be conflicting, sharing each others tables. The latter seems unlikely because we use a file lock to prevent it that the user would need to remove, although it might explain the double failure across two nodes.

jbowens commented 1 year ago

Going to close this out under the assumption that this was an operator doing something silly (like forcing two processes to share a store directory). We can re-examine if there's ever an additional report.