cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.04k stars 3.8k forks source link

stability: cluster denim restarted with fresh binaries loses nodes #13464

Closed rjnn closed 7 years ago

rjnn commented 7 years ago

The primary cause seems to be the following error, which has killed 4 of the 6 nodes:

F170207 18:29:07.322401 698433 storage/replica_command.go:1120  [n3,s3,r184/4:/Meta2/Table/64/1/3{202…-742…},@c420858600] range lookup of meta key /Meta2/Table/64/1/347276856 found only non-matching ranges: [{RangeID:442 StartKey:/Table/64/1/480623409 EndKey:/Table/64/1/506902786 Replicas:[{NodeID:4 StoreID:4 ReplicaID:1} {NodeID:5 StoreID:5 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}] NextReplicaID:4} {RangeID:837 StartKey:/Table/64/1/822402614 EndKey:/Table/64/1/848588265 Replicas:[{NodeID:2 StoreID:2 ReplicaID:5} {NodeID:1 StoreID:1 ReplicaID:4} {NodeID:3 StoreID:3 ReplicaID:3}] NextReplicaID:6}]

Shutting down all workers and restarting them loses nodes over time to the same error. Thanks to @a-robinson for helping me triage this. cc @petermattis, @tamird, and @andreimatei, if you're interested in looking at this. I have saved the logs, but would like to wipe the cluster soon (this is blocking me on actually running YCSB), so please speak up if you want the cluster preserved.

andreimatei commented 7 years ago

I don't know if the various things in #10751 explain that failure, but I don't think so... It seems serious. Would you mind looking it what that descriptor scan returns? Or find another victim to do it...

rjnn commented 7 years ago

I've saved the cluster log files and the cluster data directories locally, if someone wants to take a look at some point, but I'm wiping denim shortly to restart YCSB load generation testing.

andreimatei commented 7 years ago

@petermattis @bdarnell I think we should assign someone in the hope that it won't fall through the cracks.

tamird commented 7 years ago

Yes, we should. Note that inspecting the cluster state may require putting the data directories back on the original machines (or else machines that have the same IPs) or else the cluster won't start.

On Tue, Feb 7, 2017 at 4:29 PM, Andrei Matei notifications@github.com wrote:

@petermattis https://github.com/petermattis @bdarnell https://github.com/bdarnell I think we should assign someone in the hope that it won't fall through the cracks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/13464#issuecomment-278146135, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdsPCAaa9REOoGe150jtSyhaf9uIQWYks5raOI2gaJpZM4L58z1 .

cuongdo commented 7 years ago

I repro'ed this while running a load test. Here are the steps:

  1. Started 6-node cluster
  2. Restored TPC-H test data (~1.7 GB of data).
  3. Ran tpch -queries=1 continuously against cluster. This performs only read traffic.
  4. About 72 hours after the load test had been running stably, I shrunk the range size (to force lease rebalancing):
    echo "range_max_bytes: $[2 * 1024 * 1024]" | ./cockroach zone set .default --file=-

Minutes after, I see this fatal error on 2 of the nodes:

F170403 17:10:45.799021 2590 storage/replica_command.go:1242  [n1,s1,r2929/1:/Meta2/Table/131/9/9{67…-71…},@c42048c380] range lookup of meta key /Meta2/Table/131/9/9672/187153/221433941153021953 found only non-matching ranges: [{RangeID:3133 StartKey:/Table/131/9/9719/104698/221439738435862529 EndKey:/Table/131/9/9766/14763/221444563901022209 Replicas:[{NodeID:3 StoreID:3 ReplicaID:4} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:2 StoreID:2 ReplicaID:3}] NextReplicaID:5} {RangeID:3156 StartKey:/Table/131/9/9766/14763/221444563901022209 EndKey:/Table/131/9/9812/99811/221426605677543425 Replicas:[{NodeID:5 StoreID:5 ReplicaID:1} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:2 StoreID:2 ReplicaID:3}] NextReplicaID:4} {RangeID:3209 StartKey:/Table/131/9/9812/99811/221426605677543425 EndKey:/Table/131/9/9858/194819/221451503983099905 Replicas:[{NodeID:3 StoreID:3 ReplicaID:4} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:2 StoreID:2 ReplicaID:3}] NextReplicaID:5} {RangeID:3262 StartKey:/Table/131/9/9858/194819/221451503983099905 EndKey:/Table/131/9/9905/119904/221416298612654081 Replicas:[{NodeID:3 StoreID:3 ReplicaID:4} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:2 StoreID:2 ReplicaID:3}] NextReplicaID:5} {RangeID:3335 StartKey:/Table/131/9/9905/119904/221416298612654081 EndKey:/Table/131/9/9951/127438/221445233663541249 Replicas:[{NodeID:5 StoreID:5 ReplicaID:1} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:2 StoreID:2 ReplicaID:3}] NextReplicaID:4} {RangeID:3368 StartKey:/Table/131/9/9951/127438/221445233663541249 EndKey:/Max Replicas:[{NodeID:5 StoreID:5 ReplicaID:1} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:4}] NextReplicaID:5}]
goroutine 2590 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xa00, 0xc42fa01a89, 0x2d75280, 0xc42276d040)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:837 +0xa7
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x2d768e0, 0xc400000004, 0x25bdca1, 0x1a, 0x4da, 0xc42373a700, 0x63d)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:718 +0x583
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x7fc639659928, 0xc428cc8ba0, 0x7fc600000004, 0x2, 0x1d0fa4a, 0x3f, 0xc423734c00, 0x2, 0x2)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:146 +0x27b
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x7fc639659928, 0xc428cc8ba0, 0x1, 0xc400000004, 0x1d0fa4a, 0x3f, 0xc423734c00, 0x2, 0x2)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:67 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x7fc639659928, 0xc428cc8ba0, 0x1d0fa4a, 0x3f, 0xc423734c00, 0x2, 0x2)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:151 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.evalRangeLookup(0x7fc639659928, 0xc428cc8ba0, 0x7fc63962b390, 0xc4200a8dc0, 0xc42048c380, 0xc421aec960, 0x14b1f26001dc089f, 0x0, 0x100000001, 0x1, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_command.go:1242 +0xdd3
github.com/cockroachdb/cockroach/pkg/storage.executeCmd(0x7fc639659928, 0xc428cc8ba0, 0x0, 0x0, 0x0, 0x7fc63962b390, 0xc4200a8dc0, 0xc42048c380, 0xc421aec960, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_command.go:187 +0x301
github.com/cockroachdb/cockroach/pkg/storage.executeBatch(0x7fc639659928, 0xc428cc8ba0, 0x0, 0x0, 0x7fc63962b390, 0xc4200a8dc0, 0xc42048c380, 0xc421aec960, 0x0, 0x14b1f26001dc089f, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:4372 +0x43c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).addReadOnlyCmd(0xc42048c380, 0x7fc639659928, 0xc428cc8ba0, 0x14b1f26001dc089f, 0x0, 0x100000001, 0x1, 0xb71, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2059 +0x2c2
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc42048c380, 0x7fc639659928, 0xc428cc8ba0, 0x14b1f26001dc089f, 0x0, 0x100000001, 0x1, 0xb71, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1467 +0x482
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc420300a80, 0x7fc639659928, 0xc428cc8b10, 0x14b1f26001dc089f, 0x0, 0x100000001, 0x1, 0xb71, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2620 +0x645
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc4204d0000, 0x7fc639659928, 0xc428cc8a20, 0x0, 0x0, 0x100000001, 0x1, 0xb71, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:187 +0x1cf
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:832 +0x18c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc4203e3900, 0xc423737830, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:263 +0x105
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc420084000, 0x7fc639659928, 0xc428cc89c0, 0xc4204b2460, 0xc428cc89c0, 0xc42496fad8, 0x69b413)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:843 +0x20a
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc420084000, 0x7fc639659928, 0xc428cc89c0, 0xc4204b2460, 0xc420084000, 0xc421b5ac08, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:860 +0x99
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler(0x1c85ee0, 0xc420084000, 0x7fc639659928, 0xc428cc88d0, 0xc4204b23f0, 0x0, 0x0, 0x0, 0xcbcf96, 0x7fc63971d768)
    /go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:1867 +0x28d
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc420454a80, 0x2947a00, 0xc42045ba40, 0xc427e61200, 0xc4204ca540, 0x2905290, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:738 +0xaa0
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).handleStream(0xc420454a80, 0x2947a00, 0xc42045ba40, 0xc427e61200, 0x0)
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:932 +0x1339
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc420eb0770, 0xc420454a80, 0x2947a00, 0xc42045ba40, 0xc427e61200)
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:497 +0xa9
created by github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
    /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:498 +0xa1
cuongdo commented 7 years ago

Logs from the crashing nodes:

cockroach.stderr.1.gz cockroach.stderr.5.gz

petermattis commented 7 years ago

@tamird Can you take a look given your familiarity with the RangeLookup code path?

tamird commented 7 years ago

@cuongdo has this recurred?

cuongdo commented 7 years ago

Not yet, but I haven't run as much load recently. Will report back later this week.

tamird commented 7 years ago

Any news on this? Closing for now since there's not much to go on, but please reopen.