Closed rjnn closed 7 years ago
I don't know if the various things in #10751 explain that failure, but I don't think so... It seems serious. Would you mind looking it what that descriptor scan returns? Or find another victim to do it...
I've saved the cluster log files and the cluster data directories locally, if someone wants to take a look at some point, but I'm wiping denim shortly to restart YCSB load generation testing.
@petermattis @bdarnell I think we should assign someone in the hope that it won't fall through the cracks.
Yes, we should. Note that inspecting the cluster state may require putting the data directories back on the original machines (or else machines that have the same IPs) or else the cluster won't start.
On Tue, Feb 7, 2017 at 4:29 PM, Andrei Matei notifications@github.com wrote:
@petermattis https://github.com/petermattis @bdarnell https://github.com/bdarnell I think we should assign someone in the hope that it won't fall through the cracks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/13464#issuecomment-278146135, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdsPCAaa9REOoGe150jtSyhaf9uIQWYks5raOI2gaJpZM4L58z1 .
I repro'ed this while running a load test. Here are the steps:
tpch -queries=1
continuously against cluster. This performs only read traffic.echo "range_max_bytes: $[2 * 1024 * 1024]" | ./cockroach zone set .default --file=-
Minutes after, I see this fatal error on 2 of the nodes:
F170403 17:10:45.799021 2590 storage/replica_command.go:1242 [n1,s1,r2929/1:/Meta2/Table/131/9/9{67…-71…},@c42048c380] range lookup of meta key /Meta2/Table/131/9/9672/187153/221433941153021953 found only non-matching ranges: [{RangeID:3133 StartKey:/Table/131/9/9719/104698/221439738435862529 EndKey:/Table/131/9/9766/14763/221444563901022209 Replicas:[{NodeID:3 StoreID:3 ReplicaID:4} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:2 StoreID:2 ReplicaID:3}] NextReplicaID:5} {RangeID:3156 StartKey:/Table/131/9/9766/14763/221444563901022209 EndKey:/Table/131/9/9812/99811/221426605677543425 Replicas:[{NodeID:5 StoreID:5 ReplicaID:1} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:2 StoreID:2 ReplicaID:3}] NextReplicaID:4} {RangeID:3209 StartKey:/Table/131/9/9812/99811/221426605677543425 EndKey:/Table/131/9/9858/194819/221451503983099905 Replicas:[{NodeID:3 StoreID:3 ReplicaID:4} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:2 StoreID:2 ReplicaID:3}] NextReplicaID:5} {RangeID:3262 StartKey:/Table/131/9/9858/194819/221451503983099905 EndKey:/Table/131/9/9905/119904/221416298612654081 Replicas:[{NodeID:3 StoreID:3 ReplicaID:4} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:2 StoreID:2 ReplicaID:3}] NextReplicaID:5} {RangeID:3335 StartKey:/Table/131/9/9905/119904/221416298612654081 EndKey:/Table/131/9/9951/127438/221445233663541249 Replicas:[{NodeID:5 StoreID:5 ReplicaID:1} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:2 StoreID:2 ReplicaID:3}] NextReplicaID:4} {RangeID:3368 StartKey:/Table/131/9/9951/127438/221445233663541249 EndKey:/Max Replicas:[{NodeID:5 StoreID:5 ReplicaID:1} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:4}] NextReplicaID:5}]
goroutine 2590 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xa00, 0xc42fa01a89, 0x2d75280, 0xc42276d040)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:837 +0xa7
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x2d768e0, 0xc400000004, 0x25bdca1, 0x1a, 0x4da, 0xc42373a700, 0x63d)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:718 +0x583
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x7fc639659928, 0xc428cc8ba0, 0x7fc600000004, 0x2, 0x1d0fa4a, 0x3f, 0xc423734c00, 0x2, 0x2)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:146 +0x27b
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x7fc639659928, 0xc428cc8ba0, 0x1, 0xc400000004, 0x1d0fa4a, 0x3f, 0xc423734c00, 0x2, 0x2)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:67 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x7fc639659928, 0xc428cc8ba0, 0x1d0fa4a, 0x3f, 0xc423734c00, 0x2, 0x2)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:151 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.evalRangeLookup(0x7fc639659928, 0xc428cc8ba0, 0x7fc63962b390, 0xc4200a8dc0, 0xc42048c380, 0xc421aec960, 0x14b1f26001dc089f, 0x0, 0x100000001, 0x1, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_command.go:1242 +0xdd3
github.com/cockroachdb/cockroach/pkg/storage.executeCmd(0x7fc639659928, 0xc428cc8ba0, 0x0, 0x0, 0x0, 0x7fc63962b390, 0xc4200a8dc0, 0xc42048c380, 0xc421aec960, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_command.go:187 +0x301
github.com/cockroachdb/cockroach/pkg/storage.executeBatch(0x7fc639659928, 0xc428cc8ba0, 0x0, 0x0, 0x7fc63962b390, 0xc4200a8dc0, 0xc42048c380, 0xc421aec960, 0x0, 0x14b1f26001dc089f, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:4372 +0x43c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).addReadOnlyCmd(0xc42048c380, 0x7fc639659928, 0xc428cc8ba0, 0x14b1f26001dc089f, 0x0, 0x100000001, 0x1, 0xb71, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2059 +0x2c2
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc42048c380, 0x7fc639659928, 0xc428cc8ba0, 0x14b1f26001dc089f, 0x0, 0x100000001, 0x1, 0xb71, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1467 +0x482
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc420300a80, 0x7fc639659928, 0xc428cc8b10, 0x14b1f26001dc089f, 0x0, 0x100000001, 0x1, 0xb71, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2620 +0x645
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc4204d0000, 0x7fc639659928, 0xc428cc8a20, 0x0, 0x0, 0x100000001, 0x1, 0xb71, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:187 +0x1cf
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:832 +0x18c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc4203e3900, 0xc423737830, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:263 +0x105
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc420084000, 0x7fc639659928, 0xc428cc89c0, 0xc4204b2460, 0xc428cc89c0, 0xc42496fad8, 0x69b413)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:843 +0x20a
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc420084000, 0x7fc639659928, 0xc428cc89c0, 0xc4204b2460, 0xc420084000, 0xc421b5ac08, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:860 +0x99
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler(0x1c85ee0, 0xc420084000, 0x7fc639659928, 0xc428cc88d0, 0xc4204b23f0, 0x0, 0x0, 0x0, 0xcbcf96, 0x7fc63971d768)
/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:1867 +0x28d
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc420454a80, 0x2947a00, 0xc42045ba40, 0xc427e61200, 0xc4204ca540, 0x2905290, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:738 +0xaa0
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).handleStream(0xc420454a80, 0x2947a00, 0xc42045ba40, 0xc427e61200, 0x0)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:932 +0x1339
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc420eb0770, 0xc420454a80, 0x2947a00, 0xc42045ba40, 0xc427e61200)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:497 +0xa9
created by github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:498 +0xa1
Logs from the crashing nodes:
@tamird Can you take a look given your familiarity with the RangeLookup
code path?
@cuongdo has this recurred?
Not yet, but I haven't run as much load recently. Will report back later this week.
Any news on this? Closing for now since there's not much to go on, but please reopen.
The primary cause seems to be the following error, which has killed 4 of the 6 nodes:
Shutting down all workers and restarting them loses nodes over time to the same error. Thanks to @a-robinson for helping me triage this. cc @petermattis, @tamird, and @andreimatei, if you're interested in looking at this. I have saved the logs, but would like to wipe the cluster soon (this is blocking me on actually running YCSB), so please speak up if you want the cluster preserved.