Closed mberhault closed 8 years ago
data backed up on each node to: /mnt/data/backup.6991
the output of cockroach debug raft-log /mnt/data 1
is available at:
ubuntu@ec2-54-84-64-199.compute-1.amazonaws.com:raftlog.1
blast. we don't seem to be getting into a stable enough state to actually apply the zone config change.
ok, I added swap on each machine and the snapshot for range 1 went through. sql is usable again (including zone commands)
Looks like you picked the wrong node to run debug raft-log
on. It's tiny on that node, but huge on two of the others. The node with the tiny log runs from position 5706034 to 5706784; the others have logs starting at 6433973. So this is a case of a range being removed from a node, not being GC'd, then being re-added to that node later.
Did that node have an extended period of downtime prior to this? I think this is just a case of the raft logs growing without bound while a node is down and there is no healthy node to repair onto.
One node in the registration cluster died:
W160625 05:51:47.105941 storage/raft_log_queue.go:116 storage/raft_log_queue.go:101: raft log's olde
st index (0) is less than the first index (25269) for range 803
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
SIGABRT: abort
PC=0x7f0df6810cc9 m=7
signal arrived during cgo execution
goroutine 57 [syscall, locked to thread]:
runtime.cgocall(0x11b2d80, 0xc84c00c308, 0x7f0d00000000)
/usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc84c00c2b0 sp=0xc84c00c280
github.com/cockroachdb/cockroach/storage/engine._Cfunc_DBApplyBatchRepr(0x7f0de30dc570, 0xc8884f8000, 0x43f, 0x0, 0x0)
??:0 +0x53 fp=0xc84c00c308 sp=0xc84c00c2b0
Last runtime stats were
I160625 05:51:40.005810 server/status/runtime.go:160 runtime stats: 6.8 GiB RSS, 344 goroutines, 3.3
GiB active, 66315.43cgo/sec, 0.75/0.24 %(u/s)time, 0.00 %gc (0x)
I think the machines have ~7gig of ram, so that points to an issue here. Some of the other nodes are similarly high:
ubuntu@ip-172-31-8-73:~$ free -m -h
total used free shared buffers cached
Mem: 7.3G 6.9G 450M 32K 57M 535M
with cockroach reporting just shy of 7gb RSS.
The cluster should still be working with a node down. It clearly doesn't. The UI isn't accessible from the outside, so poking that way is a bit awkward. In any case, some raft groups are pretty long (I tried 1 which didn't exist and then 2 gave the following):
ubuntu@ip-172-31-3-145:~$ sudo ./cockroach debug raft-log /mnt/data/ 2 | grep Index: | wc -l
11024
The version running is Date: Mon May 30 14:59:10 2016 -0400
. I think it's missed out on a lot of recent goodness.
Some more random tidbits from one of the nodes:
ubuntu@ip-172-31-8-73:~$ curl -k https://localhost:8080/_status/ranges/local | grep raft_state | sort | uniq -c
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3844k 0 3844k 0 0 6541k 0 --:--:-- --:--:-- --:--:-- 6538k
756 "raft_state": "StateCandidate"
520 "raft_state": "StateFollower"
1443 "raft_state": "StateLeader"
buntu@ip-172-31-8-73:~$ curl -k https://localhost:8080/debug/stopper
0xc8203472d0: 380 tasks
6 storage/replica.go:602
367 server/node.go:803 // <- oscillates +-200
3 storage/queue.go:383
2 storage/intent_resolver.go:302
1 ts/db.go:100
Rudimentary elinks-based poking on the debug/requests endpoint shows... well, just a lot of NotLeaderErrors (without a new lease holder hint). We need to a way for us to access the admin port to make this debugging less painful.
In light of all of these bugs that we fixed since May 30, I also think we should update that version ASAP. I'm not sure what our protocol is wrt this cluster - can I simply do that?
I don't know what our protocol is either, but I would lean toward yes.
Ok. I'll pull a backup off the dataset and run last night's beta.
One node died a few minutes in with OOM, presumably due to snapshotting.
I160701 14:45:05.143248 storage/replica_raftstorage.go:524 generated snapshot for range 403 at index
3533112 in 31.747833551s. encoded size=1072891308, 6966 KV pairs, 1677671 log entries
fatal error: runtime: out of memory
That raft log is ginormous. Why do we send the full raft log on snapshots?
I think this is probably a huge Raft log that was created prior to our truncation improvements, but which was picked up by the replication queue before the truncation queue. Maybe we should put a failsafe into snapshot creation (so that any snapshot which exceeds a certain size isn't even fully created)?
Seems easier to only snapshot the necessary tail of the raft log. For a snapshot, I think we only have to send anything past the applied index of the raft log which should be very small. Ah, strike this. Now I recall that raft log truncation is itself a raft operation.
Ok, I think putting a failsafe to avoid creating excessively large snapshots is reasonable. I'll file an issue.
it died again at the same range. I think that failsafe is worthwhile - it would give the truncation queue a chance to pick it up first. The failsafe could even aggressively queue the truncation.
I'm a bit out of ideas as to how to proceed right now. In an ideal world, I could restart the cluster with upreplication turned off, and wait for the truncation queue to do its job.
Instead, I'm periodically running for host in $(cat ~/production/hosts/register); do ssh $host supervisorctl -c supervisord.conf start cockroach; done
in the hope that the truncation queue will at some point manage to get there first.
Is this happening on just one node? Are all ranges fully replicated to other nodes? Can you simply nuke this one node?
There's very little visibility since I can't access the admin ui from outside. Anyone have experience setting up an ssh-tunnel-proxy?
If one node tries to send that snapshot, chances are it's the same on the other nodes or underreplicated. In both cases, nuking the first node won't help. I also think I saw two nodes die already.
If you're running insecure you can do: ssh -L 8080:localhost:8080 <some-machine>
. I did this earlier today without difficulty.
It's a secure cluster. I'll give it a try though.
Should still work with a secure cluster.
It simply works, great. Thanks @petermattis. Would you mind running for host in $(cat ~/production/hosts/register); do ssh $host supervisorctl -c supervisord.conf start cockroach; done
in a busy loop? Can't hurt and I'm about to go on the train.
Where is this ~/production/hosts/register
file?
It's my local clone of our non-public production
repo.
Got it.
I realized that I hadn't actually managed to run the updated version because supervisord
needed to reload the config. I did that now, but the cluster is even more unhappy than before - the first range isn't being gossiped.
Restarted one of the nodes. Magically that seems to have brought the first range back in the game. Snapshot sending time.
Still in critical state, though. Had to restart one of the nodes again (to resuscitate first range gossip).
Sometimes things are relatively quiet, then large swaths of
W160701 17:58:28.362804 raft/raft.go:593 [group 1967] 4 stepped down to follower since quorum is not active
W160701 17:58:29.131940 raft/raft.go:593 [group 1178] 4 stepped down to follower since quorum is not active
W160701 17:58:29.134338 raft/raft.go:593 [group 533] 4 stepped down to follower since quorum is not active
W160701 17:58:29.272449 raft/raft.go:593 [group 3162] 4 stepped down to follower since quorum is not active
W160701 17:58:29.274708 raft/raft.go:593 [group 711] 4 stepped down to follower since quorum is not active
W160701 17:58:29.276705 raft/raft.go:593 [group 1510] 7 stepped down to follower since quorum is not active
W160701 17:58:29.281201 raft/raft.go:593 [group 4096] 4 stepped down to follower since quorum is not active
W160701 17:58:29.281295 raft/raft.go:593 [group 1169] 4 stepped down to follower since quorum is not active
W160701 17:58:29.286367 raft/raft.go:593 [group 593] 4 stepped down to follower since quorum is not active
W160701 17:58:29.377702 raft/raft.go:593 [group 773] 4 stepped down to follower since quorum is not active
W160701 17:58:29.380577 raft/raft.go:593 [group 4637] 4 stepped down to follower since quorum is not active
W160701 17:58:29.471898 raft/raft.go:593 [group 634] 4 stepped down to follower since quorum is not active
W160701 17:58:29.567202 raft/raft.go:593 [group 1375] 7 stepped down to follower since quorum is not active
W160701 17:58:30.137139 raft/raft.go:593 [group 3516] 4 stepped down to follower since quorum is not active
I think those might have to do with us reporting "unreachable" to Raft every time the outgoing message queue is full (cc @tamird). Too bad I'm not running with the per-replica outboxes yet.
If you're still running that loop, now's the time to stop it.
Pretty busted still. One node appears to be deadlocked. See my post in #3299 about the lack of visibility into the command queue. It's not currently possible to tell what's blocking this command.
goroutine 293158 [semacquire, 16 minutes]:
sync.runtime_Semacquire(0xc84af77f0c)
/usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*WaitGroup).Wait(0xc84af77f00)
/usr/local/go/src/sync/waitgroup.go:127 +0xb4
github.com/cockroachdb/cockroach/storage.(*Replica).beginCmds(0xc820371c20, 0xc820c1d490, 0x1a87e90)
/go/src/github.com/cockroachdb/cockroach/storage/replica.go:894 +0x3f6
github.com/cockroachdb/cockroach/storage.(*Replica).addWriteCmd(0xc820371c20, 0x7fef3dbcb6b8, 0xc8237702a0, 0x145d3f3921736fe0, 0x0, 0x0, 0x0, 0x407, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/replica.go:1172 +0x13f
github.com/cockroachdb/cockroach/storage.(*Replica).Send(0xc820371c20, 0x7fef3dbcb6b8, 0xc8237702a0, 0x145d3f3921736fe0, 0x0, 0x0, 0x0, 0x407, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/replica.go:801 +0x215
github.com/cockroachdb/cockroach/storage.(*Replica).CheckConsistency(0xc820371c20, 0xc82378df80, 0x25, 0x30, 0xc82378dfb0, 0x25, 0x30, 0x1, 0xc8203885a0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/replica_command.go:1617 +0x3d0
github.com/cockroachdb/cockroach/storage.(*Replica).addAdminCmd(0xc820371c20, 0x7fef3dbcb6b8, 0xc823770120, 0x145d3f3921722af1, 0x0, 0x300000003, 0x6, 0x407, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/replica.go:1076 +0x706
github.com/cockroachdb/cockroach/storage.(*Replica).Send(0xc820371c20, 0x7fef3dbcb6b8, 0xc823770120, 0x145d3f3921722af1, 0x0, 0x300000003, 0x6, 0x407, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/replica.go:807 +0x5b3
github.com/cockroachdb/cockroach/storage.(*Store).Send(0xc8204a66c0, 0x7fef3dbcb6b8, 0xc823770120, 0x145d3f3921722af1, 0x0, 0x300000003, 0x6, 0x407, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/store.go:1915 +0xd4c
github.com/cockroachdb/cockroach/storage.(*Stores).Send(0xc820379bc0, 0x7fef3dbcb6b8, 0xc8237700c0, 0x0, 0x0, 0x300000003, 0x6, 0x407, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/stores.go:178 +0x4ff
github.com/cockroachdb/cockroach/server.(*Node).Batch.func3()
/go/src/github.com/cockroachdb/cockroach/server/node.go:795 +0x559
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunTask(0xc8202f3a40, 0xc821375858, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:169 +0x129
github.com/cockroachdb/cockroach/server.(*Node).Batch(0xc820193dc0, 0x7fef3dbcb6b8, 0xc82378d860, 0xc820c1c4d0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/server/node.go:807 +0x330
github.com/cockroachdb/cockroach/roachpb._Internal_Batch_Handler(0x1a25f40, 0xc820193dc0, 0x7fef3dbcb6b8, 0xc82378d860, 0xc83bcac880, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1525 +0x168
google.golang.org/grpc.(*Server).processUnaryRPC(0xc8203617a0, 0x7fef3db8cda0, 0xc8214b3e60, 0xc821525880, 0xc82037d280, 0x2654c20, 0x0, 0x0, 0x0)
/go/src/google.golang.org/grpc/server.go:530 +0xeb5
google.golang.org/grpc.(*Server).handleStream(0xc8203617a0, 0x7fef3db8cda0, 0xc8214b3e60, 0xc821525880, 0x0)
/go/src/google.golang.org/grpc/server.go:687 +0x109d
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc821000550, 0xc8203617a0, 0x7fef3db8cda0, 0xc8214b3e60, 0xc821525880)
/go/src/google.golang.org/grpc/server.go:352 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/src/google.golang.org/grpc/server.go:353 +0x9a
The Store
is also deadlocked (not unlikely once a replica is).
goroutine 697996 [semacquire, 2 minutes]:
sync.runtime_Semacquire(0xc8204a68cc)
/usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc8204a68c8)
/usr/local/go/src/sync/mutex.go:83 +0x1c4
github.com/cockroachdb/cockroach/storage.(*Store).Send(0xc8204a66c0, 0x7fef3dbcb6b8, 0xc84bd22120, 0x145d408cbe1853a0, 0x0, 0x300000003, 0x11, 0x1, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/store.go:1890 +0x988
github.com/cockroachdb/cockroach/storage.(*Stores).Send(0xc820379bc0, 0x7fef3dbcb6b8, 0xc84bd220c0, 0x0, 0x0, 0x300000003, 0x11, 0x1, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/stores.go:178 +0x4ff
github.com/cockroachdb/cockroach/server.(*Node).Batch.func3()
/go/src/github.com/cockroachdb/cockroach/server/node.go:795 +0x559
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunTask(0xc8202f3a40, 0xc820045868, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:169 +0x129
github.com/cockroachdb/cockroach/server.(*Node).Batch(0xc820193dc0, 0x7fef3dbcb6b8, 0xc83cdabe90, 0xc820cf76c0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/server/node.go:807 +0x330
github.com/cockroachdb/cockroach/roachpb._Internal_Batch_Handler(0x1a25f40, 0xc820193dc0, 0x7fef3dbcb6b8, 0xc83cdabe90, 0xc830b707c0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1525 +0x168
google.golang.org/grpc.(*Server).processUnaryRPC(0xc8203617a0, 0x7fef3db8cda0, 0xc82b620870, 0xc8201270a0, 0xc82037d280, 0x2654c20, 0x0, 0x0, 0x0)
/go/src/google.golang.org/grpc/server.go:530 +0xeb5
google.golang.org/grpc.(*Server).handleStream(0xc8203617a0, 0x7fef3db8cda0, 0xc82b620870, 0xc8201270a0, 0x0)
/go/src/google.golang.org/grpc/server.go:687 +0x109d
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc84be5f9f0, 0xc8203617a0, 0x7fef3db8cda0, 0xc82b620870, 0xc8201270a0)
/go/src/google.golang.org/grpc/server.go:352 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/src/google.golang.org/grpc/server.go:353 +0x9a
Here's a gist of the goroutines of that node. https://gist.github.com/tschottdorf/e731c9a88bb90ba312dd7690d54da1c0
Seeing this one again too (cc @tamird):
goroutine 117324 [select, 6 minutes]:
github.com/cockroachdb/cockroach/storage.(*RaftTransport).RaftMessage(0xc8203e3710, 0x7f646de300c8, 0xc83ad2fa70, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/raft_transport.go:149 +0x216
github.com/cockroachdb/cockroach/storage._MultiRaft_RaftMessage_Handler(0x194cbc0, 0xc8203e3710, 0x7f646dfc20f0, 0xc8653de800, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/raft.pb.go:159 +0xd8
google.golang.org/grpc.(*Server).processStreamingRPC(0xc8203ce2d0, 0x7f646dfc1fa8, 0xc8217f2000, 0xc8653e61c0, 0xc8203ea900, 0x2650740, 0x0, 0x0, 0x0)
/go/src/google.golang.org/grpc/server.go:607 +0x47a
google.golang.org/grpc.(*Server).handleStream(0xc8203ce2d0, 0x7f646dfc1fa8, 0xc8217f2000, 0xc8653e61c0, 0x0)
/go/src/google.golang.org/grpc/server.go:691 +0x114e
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc821709ed0, 0xc8203ce2d0, 0x7f646dfc1fa8, 0xc8217f2000, 0xc8653e61c0)
/go/src/google.golang.org/grpc/server.go:352 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/src/google.golang.org/grpc/server.go:353 +0x9a
I restarted the cluster with a 10s
raft tick interval (since we saw mutex contention on ticking the replicas when the raft log was very large). Deadlocking was still rampant, but now that I've significantly slowed down the consistency checker, I was able to get a sign of life back from the cluster (this is also after ~50 restarts and random investigations):
ubuntu@ip-172-31-8-73:~$ ./cockroach zone --ca-cert=certs/ca.crt --cert=certs/root.client.crt --key=certs/root.client.key ls
.default
The logs seem relatively quiet. Let's hope for the best.
Ok, we seem to be working (though slow).
root@:26257> select * from registration.clusters limit 1;
+--------------------------------------------------------------------+----------------------------------------+-----------+----------+---------+-------+
| uuid | timestamp | firstName | lastName | company | email |
+--------------------------------------------------------------------+----------------------------------------+-----------+----------+---------+-------+
| "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" | 2016-03-29 20:04:52.283398 +0000 +0000 | | | | |
+--------------------------------------------------------------------+----------------------------------------+-----------+----------+---------+-------+
(1 row)
Ok, we're back to the living. A lot of material to investigate above.
Should mention that we're now running latest master
(I wanted to pick up @tamird's outgoing-raft-queue changes).
Woot!
I'm pulling the backup I made earlier to my local machine so that we can spin up clusters with that mess in them (the machines themselves have very small hdds and no tools installed to upload directly to the cloud, and someone's got to make full use of that nice fast cable we have in the office now).
One node died with:
E160701 21:07:33.563958 internal/client/txn.go:364 failure aborting transaction: context deadline exceeded; abort caused by: context deadline exceeded
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x140 pc=0x9ee630]
goroutine 766754 [running]:
panic(0x18a5e40, 0xc82000e100)
/usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/cockroachdb/cockroach/internal/client.(*Txn).sendEndTxnReq(0x0, 0xc8cfa52f00, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:440 +0x50
github.com/cockroachdb/cockroach/internal/client.(*Txn).Rollback(0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:433 +0x3b
github.com/cockroachdb/cockroach/internal/client.(*Txn).CleanupOnError(0x0, 0x7f7e91681d98, 0xc859972210)
/go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:363 +0x92
github.com/cockroachdb/cockroach/sql.(*Executor).execRequest(0xc8201b4000, 0x7f7f680e4c58, 0xc84eed71c0, 0xc8239ee800, 0xc873bbfb89, 0x27, 0x0, 0x0, 0x0, 0xc82013a800)
/go/src/github.com/cockroachdb/cockroach/sql/executor.go:501 +0xd07
github.com/cockroachdb/cockroach/sql.(*Executor).ExecuteStatements(0xc8201b4000, 0x7f7f680e4c58, 0xc84eed71c0, 0xc8239ee800, 0xc873bbfb89, 0x27, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/sql/executor.go:360 +0xf6
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).executeStatements(0xc840f33800, 0x7f7f680e4c58, 0xc84eed71c0, 0xc873bbfb89, 0x27, 0x0, 0x0, 0x0, 0x0, 0x7f7f6811fd01, ...)
/go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:640 +0x98
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).handleSimpleQuery(0xc840f33800, 0x7f7f680e4c58, 0xc84eed71c0, 0xc840f33828, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:320 +0xe8
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).serve(0xc840f33800, 0xc874c1bb20, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:275 +0x100c
github.com/cockroachdb/cockroach/sql/pgwire.(*Server).ServeConn(0xc8201ee2d0, 0x7f7f680e4d90, 0xc8666e8300, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/sql/pgwire/server.go:229 +0x98f
github.com/cockroachdb/cockroach/server.(*Server).Start.func8.1(0x7f7f680e4d30, 0xc853304420)
/go/src/github.com/cockroachdb/cockroach/server/server.go:369 +0x42
github.com/cockroachdb/cockroach/util/netutil.(*Server).ServeWith.func1(0xc8200d0218, 0x7f7f680e4d30, 0xc853304420, 0xc82000e420)
/go/src/github.com/cockroachdb/cockroach/util/netutil/net.go:131 +0x62
created by github.com/cockroachdb/cockroach/util/netutil.(*Server).ServeWith
/go/src/github.com/cockroachdb/cockroach/util/netutil/net.go:133 +0x333
The above NPE actually happened on two of the nodes.
On restart, unfortunately here's more serious stuff. Looks like an incoming Raft message triggered lazy creation of a raft group which promptly panicked. The range [10,10) suggests that we did have a truncated state in place, but commit 0
suggests that maybe there wasn't a HardState (?) cc @bdarnell
E160701 21:21:45.036857 raft/raft.go:925 [group 5161] 3 state.commit 0 is out of range [10, 10]
panic: [group 5161] 3 state.commit 0 is out of range [10, 10]
goroutine 174 [running]:
panic(0x161d900, 0xc820a2c780)
/usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/cockroachdb/cockroach/storage.(*raftLogger).Panicf(0xc820a2c5f0, 0x1be60a0, 0x2b, 0xc8222f3c80, 0x4, 0x4)
/go/src/github.com/cockroachdb/cockroach/storage/raft.go:117 +0x1ba
github.com/coreos/etcd/raft.(*raft).loadState(0xc8249ed520, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/coreos/etcd/raft/raft.go:925 +0x2a2
github.com/coreos/etcd/raft.newRaft(0xc824583c10, 0x10)
/go/src/github.com/coreos/etcd/raft/raft.go:225 +0x8ff
github.com/coreos/etcd/raft.NewRawNode(0xc824583c10, 0x0, 0x0, 0x0, 0xc820f073d0, 0x0, 0x0)
/go/src/github.com/coreos/etcd/raft/rawnode.go:76 +0xbf
github.com/cockroachdb/cockroach/storage.(*Replica).withRaftGroupLocked(0xc820f1fe00, 0xc824583db0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/replica.go:264 +0x1ca
github.com/cockroachdb/cockroach/storage.(*Replica).withRaftGroup(0xc820f1fe00, 0xc824583db0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/replica.go:302 +0x8e
github.com/cockroachdb/cockroach/storage.(*Store).handleRaftMessage(0xc82007eb40, 0xc824be4ea0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/store.go:2098 +0x678
github.com/cockroachdb/cockroach/storage.(*Store).(github.com/cockroachdb/cockroach/storage.handleRaftMessage)-fm(0xc824be4ea0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/store.go:966 +0x38
github.com/cockroachdb/cockroach/storage.(*RaftTransport).RaftMessage.func1.1.1(0x7fefe4e5f6b8, 0xc820f07fb0, 0xc8203b34d0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/raft_transport.go:139 +0x259
github.com/cockroachdb/cockroach/storage.(*RaftTransport).RaftMessage.func1.1()
/go/src/github.com/cockroachdb/cockroach/storage/raft_transport.go:143 +0x48
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc8203379d0, 0xc8206e8cf0)
/go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:142 +0x52
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
/go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:143 +0x62
Yep,
ubuntu@ip-172-31-3-145:~$ sudo ./cockroach debug check-store /mnt/data
range 5161: truncated index 10 should equal first index 0 - 1
range 5161: applied index 10 should be between first index 0 and last index 0
Maybe I messed something up in https://github.com/cockroachdb/cockroach/pull/7429?
The actual Raft log for this range is empty:
sudo ./cockroach debug raft-log /mnt/data 5161
but there is a range descriptor:
1467406975.668651933,0 /Local/Range/"\x04tsd\x12cr.node.exec.latency-1h-p99.99\x00\x01\x89\xf8\x068=2"/RangeDescriptor:
[/System/"tsd\x12cr.node.exec.latency-1h-p99.99\x00\x01\x89\xf8\x068=2", /System/"tsd\x12cr.node.exec.latency-1h-p99.999\x00\x01\x89\xf8\x06/l1")
Raw:range_id:5161 start_key:"\004tsd\022cr.node.exec.latency-1h-p99.99\000\001\211\370\0068=2" end_key:"\004tsd\022cr.node.exec.latency-1h-p99.999\000\001\211\370\006/l1" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:4 store_id:4 replica_id:3 > next_replica_id:4
so the migration code in #7429, on startup, would load the replica, see that it is "initialized", and creates a truncated state (but not the remaining initial raft state).
We have the following keys
/Local/RangeID/5161/r/AbortCache/"19a092cf-6bb6-499a-9d7a-57a20e525db5": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\0067\2114\000\001rdsc" timestamp:<wall_time:1466758804316100802 logical:0 > priority:923908
/Local/RangeID/5161/r/AbortCache/"3f9809c6-b7bc-423a-a0ff-fa3343f0a952": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006762\000\001rdsc" timestamp:<wall_time:1466460002553358297 logical:0 > priority:999376
/Local/RangeID/5161/r/AbortCache/"52a29fd4-a8c8-45d6-980f-d49d0fa84db5": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006874\000\001rdsc" timestamp:<wall_time:1467385706031819394 logical:0 > priority:369863
/Local/RangeID/5161/r/AbortCache/"750d237f-3734-42c9-b573-aa8bb5e6a2be": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006864\000\001rdsc" timestamp:<wall_time:1467384091556108317 logical:0 > priority:1150559
/Local/RangeID/5161/r/AbortCache/"84b3f36b-f94c-46e5-8cc4-63658804db77": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006864\000\001rdsc" timestamp:<wall_time:1467384081582788062 logical:0 > priority:1150559
/Local/RangeID/5161/r/AbortCache/"e475ef01-c141-4474-bbca-e2895d4ba4b6": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006864\000\001rdsc" timestamp:<wall_time:1467384085166927868 logical:0 > priority:1150559
/Local/RangeID/5161/r/AbortCache/"f04fdc11-3a22-4c0e-88f9-0e665a90077d": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006864\000\001rdsc" timestamp:<wall_time:1467384176485150309 logical:1 > priority:1150560
/Local/RangeID/5161/r"fzn-": false
/Local/RangeID/5161/r"lgc-": Type:EntryNormal Term:0 Index:0 : EMPTY
/Local/RangeID/5161/r/RaftAppliedIndex: 10
/Local/RangeID/5161/r/RaftTruncatedState: index:10 term:5
/Local/RangeID/5161/r/LeaseAppliedIndex: 0
/Local/RangeID/5161/r/RangeStats: last_update_nanos:1467406983294497855 intent_age:0 gc_bytes_age:0 live_bytes:68436218 live_count:-6165 key_bytes:-269756 key_count:-6165 val_bytes:68705974 val_count:-6165 intent_bytes:0 intent_count:0 sys_bytes:3818 sys_count:16
/Local/RangeID/5161/u/RaftHardState: term:6 vote:0 commit:0
/Local/RangeID/5161/u/RaftLastIndex: 10
/Local/RangeID/5161/u/RangeLastReplicaGCTimestamp: 1467400220.102689870,0
/Local/RangeID/5161/u/RangeLastVerificationTimestamp: 1464276110.450528716,0
and in fact everything is there expect for a HardState
. We definitely write a HardState
in writeInitialState
(i.e. when we create a new right-hand side of the split).
When we apply a snapshot though, we don't write a HardState - the HardState is usually written in the same invocation of handleRaftReady
, but not in the same RocksDB batch - maybe the process crashed between application of the snapshot and writing of the HardState, and we're vulnerable here?
Not sure the above all checks out, but maybe it does - the range again looks like one of those odd ranges which potentially only contains one key and thus never saw a new write after it initially split off.
cc @bdarnell
Not much to add here, but the RaftTransport
-related stack trace you posted (goroutine 117324 [select, 6 minutes]:
) is expected; that's one of the two goroutines that live on the remote side of a streaming RPC.
Ah, good point. I think I fell for that before.
On Fri, Jul 1, 2016 at 6:02 PM Tamir Duberstein notifications@github.com wrote:
Not much to add here, but the RaftTransport-related stack trace you posted (goroutine 117324 [select, 6 minutes]:) is expected; that's one of the two goroutines that live on the remote side of a streaming RPC.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/6991#issuecomment-230057096, or mute the thread https://github.com/notifications/unsubscribe/AE135KRkyGAgNiEfca-XU0p6CPAWSFXIks5qRY5sgaJpZM4IryEz .
-- Tobias
I'm preparing a change to deal with the HardState appropriately. Appreciate feedback on the above.
I brought back the registration cluster with a newly initialized cluster so that new data can be recorded.
The prior data has been moved to /mnt/data/backup-20160705
on each machine. I wasn't able to reliably bring up any subset of the cluster to perform a SQL dump, so recovery of the existing data will need to wait until #7598 goes in..
Re-bootstrapping the cluster took longer than expected. The supervisor.conf
file doesn't work for the first run. Every node kept trying to create a gossip connection to every other node while not making any progress on bootstrapping. The solution was to omit the --join
parameter from the cockroach
command line for the first node. After that, every other node successfully joined the cluster.
Also, I needed to remove the 10s raft tick interval @tschottdorf put in to try to revive the "old" cluster.
via @tschottdorf: raw data is preserved in s3, but has been dumped and imported into the new cluster.
@bdarnell: I'll be keeping track of actions and results here.
Quick summary: the registration cluster is falling over repeatedly due to large snapshot sizes. Specifically, recipients of range 1 snapshots OOM during
applySnapshot
.eg, on node 2
ec2-52-91-3-164.compute-1.amazonaws.com
:There is no corresponding "applied snapshot for range 1" message, and the stack trace does list an
applySnapshot
entry. Can't confirm from the trace that it is for that range (the range ID is not one of the simple arguments), but it most likely is. Similar pattern appeared multiple times.I will perform the following to try to resurrect the cluster: