stability: resurrecting registration cluster

mberhault commented 8 years ago

@bdarnell: I'll be keeping track of actions and results here.

Quick summary: the registration cluster is falling over repeatedly due to large snapshot sizes. Specifically, recipients of range 1 snapshots OOM during applySnapshot.

eg, on node 2 ec2-52-91-3-164.compute-1.amazonaws.com:

I160601 16:24:14.957105 storage/replica_raftstorage.go:610  received snapshot for range 1 at index 6818966. encoded size=1204695315, 14475 KV pairs, 384994 log entries
...
I160601 16:24:25.479455 /go/src/google.golang.org/grpc/clientconn.go:499  grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 172.31.14.204:26257: getsockopt: connection refused"; Reconnecting to "ip-172-31-14-204:26257"
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort
PC=0x7f78fae4acc9 m=7
signal arrived during cgo execution

goroutine 86 [syscall, locked to thread]:
runtime.cgocall(0x11b2d80, 0xc822780308, 0x7f7800000000)
        /usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc8227802b0 sp=0xc822780280
github.com/cockroachdb/cockroach/storage/engine._Cfunc_DBApplyBatchRepr(0x7f78ee825b90, 0xc917dce000, 0x208f, 0x0, 0x0)
        ??:0 +0x53 fp=0xc822780308 sp=0xc8227802b0
github.com/cockroachdb/cockroach/storage/engine.dbApplyBatchRepr(0x7f78ee825b90, 0xc917dce000, 0x208f, 0x4000000, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:990 +0x138 fp=0xc8227803a0 sp=0xc822780308
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatch).ApplyBatchRepr(0xc822348000, 0xc917dce000, 0x208f, 0x4000000, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:578 +0x4f fp=0xc8227803d8 sp=0xc8227803a0
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatch).flushMutations(0xc822348000)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:672 +0x146 fp=0xc822780468 sp=0xc8227803d8
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatchIterator).Seek(0xc822348030, 0xc934324b80, 0x10, 0x20, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:498 +0x25 fp=0xc8227804a0 sp=0xc822780468
github.com/cockroachdb/cockroach/storage/engine.mvccGetMetadata(0x7f78fbaca080, 0xc822348030, 0xc934324b80, 0x10, 0x20, 0x0, 0x0, 0xc938fc6000, 0xc938fc6000, 0x11, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:684 +0xcf fp=0xc8227805d8 sp=0xc8227804a0
github.com/cockroachdb/cockroach/storage/engine.mvccPutInternal(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbaca120, 0xc822348000, 0x7f78fbaca080, 0xc822348030, 0x0, 0xc934324b80, 0x10, 0x20, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:1020 +0x1f9 fp=0xc822780b20 sp=0xc8227805d8
github.com/cockroachdb/cockroach/storage/engine.mvccPutUsingIter(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbaca120, 0xc822348000, 0x7f78fbaca080, 0xc822348030, 0x0, 0xc934324b80, 0x10, 0x20, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:988 +0x1bb fp=0xc822780bf8 sp=0xc822780b20
github.com/cockroachdb/cockroach/storage/engine.MVCCPut(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc934324b80, 0x10, 0x20, 0x0, 0xc800000000, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:926 +0x1bb fp=0xc822780cc8 sp=0xc822780bf8
github.com/cockroachdb/cockroach/storage/engine.MVCCPutProto(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc934324b80, 0x10, 0x20, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:549 +0x1c5 fp=0xc822780d98 sp=0xc822780cc8
github.com/cockroachdb/cockroach/storage.(*Replica).append(0xc8237fd9e0, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc8f810e000, 0x5dfe2, 0x5dfe2, 0x1626, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica_raftstorage.go:547 +0x20f fp=0xc822780ed8 sp=0xc822780d98
github.com/cockroachdb/cockroach/storage.(*Replica).applySnapshot(0xc8237fd9e0, 0x7f78fbac9f08, 0xc822348000, 0xc88f6b6000, 0x47ce3113, 0x47ce4000, 0xc821b06cc0, 0x4, 0x4, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/replica_raftstorage.go:670 +0xc06 fp=0xc8227812f8 sp=0xc822780ed8
github.com/cockroachdb/cockroach/storage.(*Replica).handleRaftReady(0xc8237fd9e0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:1425 +0x29e fp=0xc822781bd0 sp=0xc8227812f8
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2055 +0x35d fp=0xc822781f60 sp=0xc822781bd0
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc8202efce0, 0xc82206b240)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139 +0x52 fp=0xc822781f80 sp=0xc822781f60
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc822781f88 sp=0xc822781f80
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:140 +0x62

There is no corresponding "applied snapshot for range 1" message, and the stack trace does list an applySnapshot entry. Can't confirm from the trace that it is for that range (the range ID is not one of the simple arguments), but it most likely is. Similar pattern appeared multiple times.

I will perform the following to try to resurrect the cluster:

stop all nodes
backup all rocksdb data
change the default zone config to have only two replicas

mberhault commented 8 years ago

data backed up on each node to: /mnt/data/backup.6991

mberhault commented 8 years ago

the output of cockroach debug raft-log /mnt/data 1 is available at: ubuntu@ec2-54-84-64-199.compute-1.amazonaws.com:raftlog.1

mberhault commented 8 years ago

blast. we don't seem to be getting into a stable enough state to actually apply the zone config change.

mberhault commented 8 years ago

ok, I added swap on each machine and the snapshot for range 1 went through. sql is usable again (including zone commands)

bdarnell commented 8 years ago

Looks like you picked the wrong node to run debug raft-log on. It's tiny on that node, but huge on two of the others. The node with the tiny log runs from position 5706034 to 5706784; the others have logs starting at 6433973. So this is a case of a range being removed from a node, not being GC'd, then being re-added to that node later.

Did that node have an extended period of downtime prior to this? I think this is just a case of the raft logs growing without bound while a node is down and there is no healthy node to repair onto.

tbg commented 8 years ago

One node in the registration cluster died:

W160625 05:51:47.105941 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's olde
st index (0) is less than the first index (25269) for range 803
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort
PC=0x7f0df6810cc9 m=7
signal arrived during cgo execution

goroutine 57 [syscall, locked to thread]:
runtime.cgocall(0x11b2d80, 0xc84c00c308, 0x7f0d00000000)
        /usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc84c00c2b0 sp=0xc84c00c280
github.com/cockroachdb/cockroach/storage/engine._Cfunc_DBApplyBatchRepr(0x7f0de30dc570, 0xc8884f8000, 0x43f, 0x0, 0x0)
        ??:0 +0x53 fp=0xc84c00c308 sp=0xc84c00c2b0

tbg commented 8 years ago

Last runtime stats were

I160625 05:51:40.005810 server/status/runtime.go:160  runtime stats: 6.8 GiB RSS, 344 goroutines, 3.3
 GiB active, 66315.43cgo/sec, 0.75/0.24 %(u/s)time, 0.00 %gc (0x)

I think the machines have ~7gig of ram, so that points to an issue here. Some of the other nodes are similarly high:

ubuntu@ip-172-31-8-73:~$ free -m -h
             total       used       free     shared    buffers     cached
Mem:          7.3G       6.9G       450M        32K        57M       535M

with cockroach reporting just shy of 7gb RSS.

The cluster should still be working with a node down. It clearly doesn't. The UI isn't accessible from the outside, so poking that way is a bit awkward. In any case, some raft groups are pretty long (I tried 1 which didn't exist and then 2 gave the following):

ubuntu@ip-172-31-3-145:~$ sudo ./cockroach debug raft-log /mnt/data/ 2 | grep Index: | wc -l
11024

tbg commented 8 years ago

The version running is Date: Mon May 30 14:59:10 2016 -0400. I think it's missed out on a lot of recent goodness.

tbg commented 8 years ago

Some more random tidbits from one of the nodes:

ubuntu@ip-172-31-8-73:~$ curl -k https://localhost:8080/_status/ranges/local | grep raft_state | sort | uniq -c
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3844k    0 3844k    0     0  6541k      0 --:--:-- --:--:-- --:--:-- 6538k
    756       "raft_state": "StateCandidate"
    520       "raft_state": "StateFollower"
   1443       "raft_state": "StateLeader"

buntu@ip-172-31-8-73:~$ curl -k https://localhost:8080/debug/stopper
0xc8203472d0: 380 tasks
6      storage/replica.go:602
367    server/node.go:803 // <- oscillates +-200
3      storage/queue.go:383
2      storage/intent_resolver.go:302
1      ts/db.go:100

Rudimentary elinks-based poking on the debug/requests endpoint shows... well, just a lot of NotLeaderErrors (without a new lease holder hint). We need to a way for us to access the admin port to make this debugging less painful.

In light of all of these bugs that we fixed since May 30, I also think we should update that version ASAP. I'm not sure what our protocol is wrt this cluster - can I simply do that?

petermattis commented 8 years ago

I don't know what our protocol is either, but I would lean toward yes.

tbg commented 8 years ago

Ok. I'll pull a backup off the dataset and run last night's beta.

tbg commented 8 years ago

One node died a few minutes in with OOM, presumably due to snapshotting.

I160701 14:45:05.143248 storage/replica_raftstorage.go:524  generated snapshot for range 403 at index
 3533112 in 31.747833551s. encoded size=1072891308, 6966 KV pairs, 1677671 log entries
fatal error: runtime: out of memory

petermattis commented 8 years ago

That raft log is ginormous. Why do we send the full raft log on snapshots?

tbg commented 8 years ago

I think this is probably a huge Raft log that was created prior to our truncation improvements, but which was picked up by the replication queue before the truncation queue. Maybe we should put a failsafe into snapshot creation (so that any snapshot which exceeds a certain size isn't even fully created)?

petermattis commented 8 years ago

Seems easier to only snapshot the necessary tail of the raft log. For a snapshot, I think we only have to send anything past the applied index of the raft log which should be very small. Ah, strike this. Now I recall that raft log truncation is itself a raft operation.

Ok, I think putting a failsafe to avoid creating excessively large snapshots is reasonable. I'll file an issue.

tbg commented 8 years ago

it died again at the same range. I think that failsafe is worthwhile - it would give the truncation queue a chance to pick it up first. The failsafe could even aggressively queue the truncation.

tbg commented 8 years ago

I'm a bit out of ideas as to how to proceed right now. In an ideal world, I could restart the cluster with upreplication turned off, and wait for the truncation queue to do its job. Instead, I'm periodically running for host in $(cat ~/production/hosts/register); do ssh $host supervisorctl -c supervisord.conf start cockroach; done in the hope that the truncation queue will at some point manage to get there first.

tamird commented 8 years ago

Is this happening on just one node? Are all ranges fully replicated to other nodes? Can you simply nuke this one node?

tbg commented 8 years ago

There's very little visibility since I can't access the admin ui from outside. Anyone have experience setting up an ssh-tunnel-proxy?

If one node tries to send that snapshot, chances are it's the same on the other nodes or underreplicated. In both cases, nuking the first node won't help. I also think I saw two nodes die already.

petermattis commented 8 years ago

If you're running insecure you can do: ssh -L 8080:localhost:8080 <some-machine>. I did this earlier today without difficulty.

tbg commented 8 years ago

It's a secure cluster. I'll give it a try though.

petermattis commented 8 years ago

Should still work with a secure cluster.

tbg commented 8 years ago

It simply works, great. Thanks @petermattis. Would you mind running for host in $(cat ~/production/hosts/register); do ssh $host supervisorctl -c supervisord.conf start cockroach; done in a busy loop? Can't hurt and I'm about to go on the train.

petermattis commented 8 years ago

Where is this ~/production/hosts/register file?

tbg commented 8 years ago

It's my local clone of our non-public production repo.

petermattis commented 8 years ago

Got it.

tbg commented 8 years ago

I realized that I hadn't actually managed to run the updated version because supervisord needed to reload the config. I did that now, but the cluster is even more unhappy than before - the first range isn't being gossiped.

tbg commented 8 years ago

Restarted one of the nodes. Magically that seems to have brought the first range back in the game. Snapshot sending time.

tbg commented 8 years ago

Still in critical state, though. Had to restart one of the nodes again (to resuscitate first range gossip).

Sometimes things are relatively quiet, then large swaths of

W160701 17:58:28.362804 raft/raft.go:593  [group 1967] 4 stepped down to follower since quorum is not active
W160701 17:58:29.131940 raft/raft.go:593  [group 1178] 4 stepped down to follower since quorum is not active
W160701 17:58:29.134338 raft/raft.go:593  [group 533] 4 stepped down to follower since quorum is not active
W160701 17:58:29.272449 raft/raft.go:593  [group 3162] 4 stepped down to follower since quorum is not active
W160701 17:58:29.274708 raft/raft.go:593  [group 711] 4 stepped down to follower since quorum is not active
W160701 17:58:29.276705 raft/raft.go:593  [group 1510] 7 stepped down to follower since quorum is not active
W160701 17:58:29.281201 raft/raft.go:593  [group 4096] 4 stepped down to follower since quorum is not active
W160701 17:58:29.281295 raft/raft.go:593  [group 1169] 4 stepped down to follower since quorum is not active
W160701 17:58:29.286367 raft/raft.go:593  [group 593] 4 stepped down to follower since quorum is not active
W160701 17:58:29.377702 raft/raft.go:593  [group 773] 4 stepped down to follower since quorum is not active
W160701 17:58:29.380577 raft/raft.go:593  [group 4637] 4 stepped down to follower since quorum is not active
W160701 17:58:29.471898 raft/raft.go:593  [group 634] 4 stepped down to follower since quorum is not active
W160701 17:58:29.567202 raft/raft.go:593  [group 1375] 7 stepped down to follower since quorum is not active
W160701 17:58:30.137139 raft/raft.go:593  [group 3516] 4 stepped down to follower since quorum is not active

I think those might have to do with us reporting "unreachable" to Raft every time the outgoing message queue is full (cc @tamird). Too bad I'm not running with the per-replica outboxes yet.

tbg commented 8 years ago

If you're still running that loop, now's the time to stop it.

tbg commented 8 years ago

Pretty busted still. One node appears to be deadlocked. See my post in #3299 about the lack of visibility into the command queue. It's not currently possible to tell what's blocking this command.

goroutine 293158 [semacquire, 16 minutes]:
sync.runtime_Semacquire(0xc84af77f0c)
    /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*WaitGroup).Wait(0xc84af77f00)
    /usr/local/go/src/sync/waitgroup.go:127 +0xb4
github.com/cockroachdb/cockroach/storage.(*Replica).beginCmds(0xc820371c20, 0xc820c1d490, 0x1a87e90)
    /go/src/github.com/cockroachdb/cockroach/storage/replica.go:894 +0x3f6
github.com/cockroachdb/cockroach/storage.(*Replica).addWriteCmd(0xc820371c20, 0x7fef3dbcb6b8, 0xc8237702a0, 0x145d3f3921736fe0, 0x0, 0x0, 0x0, 0x407, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/replica.go:1172 +0x13f
github.com/cockroachdb/cockroach/storage.(*Replica).Send(0xc820371c20, 0x7fef3dbcb6b8, 0xc8237702a0, 0x145d3f3921736fe0, 0x0, 0x0, 0x0, 0x407, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/replica.go:801 +0x215
github.com/cockroachdb/cockroach/storage.(*Replica).CheckConsistency(0xc820371c20, 0xc82378df80, 0x25, 0x30, 0xc82378dfb0, 0x25, 0x30, 0x1, 0xc8203885a0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/replica_command.go:1617 +0x3d0
github.com/cockroachdb/cockroach/storage.(*Replica).addAdminCmd(0xc820371c20, 0x7fef3dbcb6b8, 0xc823770120, 0x145d3f3921722af1, 0x0, 0x300000003, 0x6, 0x407, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/replica.go:1076 +0x706
github.com/cockroachdb/cockroach/storage.(*Replica).Send(0xc820371c20, 0x7fef3dbcb6b8, 0xc823770120, 0x145d3f3921722af1, 0x0, 0x300000003, 0x6, 0x407, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/replica.go:807 +0x5b3
github.com/cockroachdb/cockroach/storage.(*Store).Send(0xc8204a66c0, 0x7fef3dbcb6b8, 0xc823770120, 0x145d3f3921722af1, 0x0, 0x300000003, 0x6, 0x407, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/store.go:1915 +0xd4c
github.com/cockroachdb/cockroach/storage.(*Stores).Send(0xc820379bc0, 0x7fef3dbcb6b8, 0xc8237700c0, 0x0, 0x0, 0x300000003, 0x6, 0x407, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/stores.go:178 +0x4ff
github.com/cockroachdb/cockroach/server.(*Node).Batch.func3()
    /go/src/github.com/cockroachdb/cockroach/server/node.go:795 +0x559
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunTask(0xc8202f3a40, 0xc821375858, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:169 +0x129
github.com/cockroachdb/cockroach/server.(*Node).Batch(0xc820193dc0, 0x7fef3dbcb6b8, 0xc82378d860, 0xc820c1c4d0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/server/node.go:807 +0x330
github.com/cockroachdb/cockroach/roachpb._Internal_Batch_Handler(0x1a25f40, 0xc820193dc0, 0x7fef3dbcb6b8, 0xc82378d860, 0xc83bcac880, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1525 +0x168
google.golang.org/grpc.(*Server).processUnaryRPC(0xc8203617a0, 0x7fef3db8cda0, 0xc8214b3e60, 0xc821525880, 0xc82037d280, 0x2654c20, 0x0, 0x0, 0x0)
    /go/src/google.golang.org/grpc/server.go:530 +0xeb5
google.golang.org/grpc.(*Server).handleStream(0xc8203617a0, 0x7fef3db8cda0, 0xc8214b3e60, 0xc821525880, 0x0)
    /go/src/google.golang.org/grpc/server.go:687 +0x109d
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc821000550, 0xc8203617a0, 0x7fef3db8cda0, 0xc8214b3e60, 0xc821525880)
    /go/src/google.golang.org/grpc/server.go:352 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/src/google.golang.org/grpc/server.go:353 +0x9a

tbg commented 8 years ago

The Store is also deadlocked (not unlikely once a replica is).

goroutine 697996 [semacquire, 2 minutes]:
sync.runtime_Semacquire(0xc8204a68cc)
    /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc8204a68c8)
    /usr/local/go/src/sync/mutex.go:83 +0x1c4
github.com/cockroachdb/cockroach/storage.(*Store).Send(0xc8204a66c0, 0x7fef3dbcb6b8, 0xc84bd22120, 0x145d408cbe1853a0, 0x0, 0x300000003, 0x11, 0x1, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/store.go:1890 +0x988
github.com/cockroachdb/cockroach/storage.(*Stores).Send(0xc820379bc0, 0x7fef3dbcb6b8, 0xc84bd220c0, 0x0, 0x0, 0x300000003, 0x11, 0x1, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/storage/stores.go:178 +0x4ff
github.com/cockroachdb/cockroach/server.(*Node).Batch.func3()
    /go/src/github.com/cockroachdb/cockroach/server/node.go:795 +0x559
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunTask(0xc8202f3a40, 0xc820045868, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:169 +0x129
github.com/cockroachdb/cockroach/server.(*Node).Batch(0xc820193dc0, 0x7fef3dbcb6b8, 0xc83cdabe90, 0xc820cf76c0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/server/node.go:807 +0x330
github.com/cockroachdb/cockroach/roachpb._Internal_Batch_Handler(0x1a25f40, 0xc820193dc0, 0x7fef3dbcb6b8, 0xc83cdabe90, 0xc830b707c0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1525 +0x168
google.golang.org/grpc.(*Server).processUnaryRPC(0xc8203617a0, 0x7fef3db8cda0, 0xc82b620870, 0xc8201270a0, 0xc82037d280, 0x2654c20, 0x0, 0x0, 0x0)
    /go/src/google.golang.org/grpc/server.go:530 +0xeb5
google.golang.org/grpc.(*Server).handleStream(0xc8203617a0, 0x7fef3db8cda0, 0xc82b620870, 0xc8201270a0, 0x0)
    /go/src/google.golang.org/grpc/server.go:687 +0x109d
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc84be5f9f0, 0xc8203617a0, 0x7fef3db8cda0, 0xc82b620870, 0xc8201270a0)
    /go/src/google.golang.org/grpc/server.go:352 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/src/google.golang.org/grpc/server.go:353 +0x9a

tbg commented 8 years ago

Here's a gist of the goroutines of that node. https://gist.github.com/tschottdorf/e731c9a88bb90ba312dd7690d54da1c0

tbg commented 8 years ago

Seeing this one again too (cc @tamird):

goroutine 117324 [select, 6 minutes]:
github.com/cockroachdb/cockroach/storage.(*RaftTransport).RaftMessage(0xc8203e3710, 0x7f646de300c8, 0xc83ad2fa70, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/storage/raft_transport.go:149 +0x216
github.com/cockroachdb/cockroach/storage._MultiRaft_RaftMessage_Handler(0x194cbc0, 0xc8203e3710, 0x7f646dfc20f0, 0xc8653de800, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/storage/raft.pb.go:159 +0xd8
google.golang.org/grpc.(*Server).processStreamingRPC(0xc8203ce2d0, 0x7f646dfc1fa8, 0xc8217f2000, 0xc8653e61c0, 0xc8203ea900, 0x2650740, 0x0, 0x0, 0x0)
    /go/src/google.golang.org/grpc/server.go:607 +0x47a
google.golang.org/grpc.(*Server).handleStream(0xc8203ce2d0, 0x7f646dfc1fa8, 0xc8217f2000, 0xc8653e61c0, 0x0)
    /go/src/google.golang.org/grpc/server.go:691 +0x114e
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc821709ed0, 0xc8203ce2d0, 0x7f646dfc1fa8, 0xc8217f2000, 0xc8653e61c0)
    /go/src/google.golang.org/grpc/server.go:352 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/src/google.golang.org/grpc/server.go:353 +0x9a

tbg commented 8 years ago

I restarted the cluster with a 10s raft tick interval (since we saw mutex contention on ticking the replicas when the raft log was very large). Deadlocking was still rampant, but now that I've significantly slowed down the consistency checker, I was able to get a sign of life back from the cluster (this is also after ~50 restarts and random investigations):

ubuntu@ip-172-31-8-73:~$ ./cockroach zone --ca-cert=certs/ca.crt --cert=certs/root.client.crt --key=certs/root.client.key ls
.default

The logs seem relatively quiet. Let's hope for the best.

tbg commented 8 years ago

Ok, we seem to be working (though slow).

root@:26257> select * from registration.clusters limit 1;
+--------------------------------------------------------------------+----------------------------------------+-----------+----------+---------+-------+
|                                uuid                                |               timestamp                | firstName | lastName | company | email |
+--------------------------------------------------------------------+----------------------------------------+-----------+----------+---------+-------+
| "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" | 2016-03-29 20:04:52.283398 +0000 +0000 |           |          |         |       |
+--------------------------------------------------------------------+----------------------------------------+-----------+----------+---------+-------+
(1 row)

tbg commented 8 years ago

Ok, we're back to the living. A lot of material to investigate above.

Should mention that we're now running latest master (I wanted to pick up @tamird's outgoing-raft-queue changes).

petermattis commented 8 years ago

Woot!

tbg commented 8 years ago

I'm pulling the backup I made earlier to my local machine so that we can spin up clusters with that mess in them (the machines themselves have very small hdds and no tools installed to upload directly to the cloud, and someone's got to make full use of that nice fast cable we have in the office now).

tbg commented 8 years ago

One node died with:

E160701 21:07:33.563958 internal/client/txn.go:364  failure aborting transaction: context deadline exceeded; abort caused by: context deadline exceeded
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x140 pc=0x9ee630]

goroutine 766754 [running]:
panic(0x18a5e40, 0xc82000e100)
        /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/cockroachdb/cockroach/internal/client.(*Txn).sendEndTxnReq(0x0, 0xc8cfa52f00, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:440 +0x50
github.com/cockroachdb/cockroach/internal/client.(*Txn).Rollback(0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:433 +0x3b
github.com/cockroachdb/cockroach/internal/client.(*Txn).CleanupOnError(0x0, 0x7f7e91681d98, 0xc859972210)
        /go/src/github.com/cockroachdb/cockroach/internal/client/txn.go:363 +0x92
github.com/cockroachdb/cockroach/sql.(*Executor).execRequest(0xc8201b4000, 0x7f7f680e4c58, 0xc84eed71c0, 0xc8239ee800, 0xc873bbfb89, 0x27, 0x0, 0x0, 0x0, 0xc82013a800)
        /go/src/github.com/cockroachdb/cockroach/sql/executor.go:501 +0xd07
github.com/cockroachdb/cockroach/sql.(*Executor).ExecuteStatements(0xc8201b4000, 0x7f7f680e4c58, 0xc84eed71c0, 0xc8239ee800, 0xc873bbfb89, 0x27, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/sql/executor.go:360 +0xf6
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).executeStatements(0xc840f33800, 0x7f7f680e4c58, 0xc84eed71c0, 0xc873bbfb89, 0x27, 0x0, 0x0, 0x0, 0x0, 0x7f7f6811fd01, ...)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:640 +0x98
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).handleSimpleQuery(0xc840f33800, 0x7f7f680e4c58, 0xc84eed71c0, 0xc840f33828, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:320 +0xe8
github.com/cockroachdb/cockroach/sql/pgwire.(*v3Conn).serve(0xc840f33800, 0xc874c1bb20, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/v3.go:275 +0x100c
github.com/cockroachdb/cockroach/sql/pgwire.(*Server).ServeConn(0xc8201ee2d0, 0x7f7f680e4d90, 0xc8666e8300, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/sql/pgwire/server.go:229 +0x98f
github.com/cockroachdb/cockroach/server.(*Server).Start.func8.1(0x7f7f680e4d30, 0xc853304420)
        /go/src/github.com/cockroachdb/cockroach/server/server.go:369 +0x42
github.com/cockroachdb/cockroach/util/netutil.(*Server).ServeWith.func1(0xc8200d0218, 0x7f7f680e4d30, 0xc853304420, 0xc82000e420)
        /go/src/github.com/cockroachdb/cockroach/util/netutil/net.go:131 +0x62
created by github.com/cockroachdb/cockroach/util/netutil.(*Server).ServeWith
        /go/src/github.com/cockroachdb/cockroach/util/netutil/net.go:133 +0x333

Filed https://github.com/cockroachdb/cockroach/issues/7595

tbg commented 8 years ago

The above NPE actually happened on two of the nodes.

On restart, unfortunately here's more serious stuff. Looks like an incoming Raft message triggered lazy creation of a raft group which promptly panicked. The range [10,10) suggests that we did have a truncated state in place, but commit 0 suggests that maybe there wasn't a HardState (?) cc @bdarnell

E160701 21:21:45.036857 raft/raft.go:925  [group 5161] 3 state.commit 0 is out of range [10, 10]
panic: [group 5161] 3 state.commit 0 is out of range [10, 10]

goroutine 174 [running]:
panic(0x161d900, 0xc820a2c780)
        /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/cockroachdb/cockroach/storage.(*raftLogger).Panicf(0xc820a2c5f0, 0x1be60a0, 0x2b, 0xc8222f3c80, 0x4, 0x4)
        /go/src/github.com/cockroachdb/cockroach/storage/raft.go:117 +0x1ba
github.com/coreos/etcd/raft.(*raft).loadState(0xc8249ed520, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/coreos/etcd/raft/raft.go:925 +0x2a2
github.com/coreos/etcd/raft.newRaft(0xc824583c10, 0x10)
        /go/src/github.com/coreos/etcd/raft/raft.go:225 +0x8ff
github.com/coreos/etcd/raft.NewRawNode(0xc824583c10, 0x0, 0x0, 0x0, 0xc820f073d0, 0x0, 0x0)
        /go/src/github.com/coreos/etcd/raft/rawnode.go:76 +0xbf
github.com/cockroachdb/cockroach/storage.(*Replica).withRaftGroupLocked(0xc820f1fe00, 0xc824583db0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:264 +0x1ca
github.com/cockroachdb/cockroach/storage.(*Replica).withRaftGroup(0xc820f1fe00, 0xc824583db0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:302 +0x8e
github.com/cockroachdb/cockroach/storage.(*Store).handleRaftMessage(0xc82007eb40, 0xc824be4ea0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2098 +0x678
github.com/cockroachdb/cockroach/storage.(*Store).(github.com/cockroachdb/cockroach/storage.handleRaftMessage)-fm(0xc824be4ea0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:966 +0x38
github.com/cockroachdb/cockroach/storage.(*RaftTransport).RaftMessage.func1.1.1(0x7fefe4e5f6b8, 0xc820f07fb0, 0xc8203b34d0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/raft_transport.go:139 +0x259
github.com/cockroachdb/cockroach/storage.(*RaftTransport).RaftMessage.func1.1()
        /go/src/github.com/cockroachdb/cockroach/storage/raft_transport.go:143 +0x48
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc8203379d0, 0xc8206e8cf0)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:142 +0x52
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:143 +0x62

tbg commented 8 years ago

Yep,

ubuntu@ip-172-31-3-145:~$ sudo ./cockroach debug check-store /mnt/data
range 5161: truncated index 10 should equal first index 0 - 1
range 5161: applied index 10 should be between first index 0 and last index 0

Maybe I messed something up in https://github.com/cockroachdb/cockroach/pull/7429?

tbg commented 8 years ago

The actual Raft log for this range is empty:

sudo ./cockroach debug raft-log /mnt/data 5161

but there is a range descriptor:

1467406975.668651933,0 /Local/Range/"\x04tsd\x12cr.node.exec.latency-1h-p99.99\x00\x01\x89\xf8\x068=2"/RangeDescriptor:
[/System/"tsd\x12cr.node.exec.latency-1h-p99.99\x00\x01\x89\xf8\x068=2", /System/"tsd\x12cr.node.exec.latency-1h-p99.999\x00\x01\x89\xf8\x06/l1")
        Raw:range_id:5161 start_key:"\004tsd\022cr.node.exec.latency-1h-p99.99\000\001\211\370\0068=2" end_key:"\004tsd\022cr.node.exec.latency-1h-p99.999\000\001\211\370\006/l1" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:4 store_id:4 replica_id:3 > next_replica_id:4

so the migration code in #7429, on startup, would load the replica, see that it is "initialized", and creates a truncated state (but not the remaining initial raft state).

We have the following keys

/Local/RangeID/5161/r/AbortCache/"19a092cf-6bb6-499a-9d7a-57a20e525db5": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\0067\2114\000\001rdsc" timestamp:<wall_time:1466758804316100802 logical:0 > priority:923908
/Local/RangeID/5161/r/AbortCache/"3f9809c6-b7bc-423a-a0ff-fa3343f0a952": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006762\000\001rdsc" timestamp:<wall_time:1466460002553358297 logical:0 > priority:999376
/Local/RangeID/5161/r/AbortCache/"52a29fd4-a8c8-45d6-980f-d49d0fa84db5": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006874\000\001rdsc" timestamp:<wall_time:1467385706031819394 logical:0 > priority:369863
/Local/RangeID/5161/r/AbortCache/"750d237f-3734-42c9-b573-aa8bb5e6a2be": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006864\000\001rdsc" timestamp:<wall_time:1467384091556108317 logical:0 > priority:1150559
/Local/RangeID/5161/r/AbortCache/"84b3f36b-f94c-46e5-8cc4-63658804db77": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006864\000\001rdsc" timestamp:<wall_time:1467384081582788062 logical:0 > priority:1150559
/Local/RangeID/5161/r/AbortCache/"e475ef01-c141-4474-bbca-e2895d4ba4b6": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006864\000\001rdsc" timestamp:<wall_time:1467384085166927868 logical:0 > priority:1150559
/Local/RangeID/5161/r/AbortCache/"f04fdc11-3a22-4c0e-88f9-0e665a90077d": key:"\001k\022\004tsd\022cr.node.exec.latency-1h-p99.99\000\377\001\211\370\006864\000\001rdsc" timestamp:<wall_time:1467384176485150309 logical:1 > priority:1150560
/Local/RangeID/5161/r"fzn-": false
/Local/RangeID/5161/r"lgc-": Type:EntryNormal Term:0 Index:0 : EMPTY

/Local/RangeID/5161/r/RaftAppliedIndex: 10
/Local/RangeID/5161/r/RaftTruncatedState: index:10 term:5
/Local/RangeID/5161/r/LeaseAppliedIndex: 0
/Local/RangeID/5161/r/RangeStats: last_update_nanos:1467406983294497855 intent_age:0 gc_bytes_age:0 live_bytes:68436218 live_count:-6165 key_bytes:-269756 key_count:-6165 val_bytes:68705974 val_count:-6165 intent_bytes:0 intent_count:0 sys_bytes:3818 sys_count:16
/Local/RangeID/5161/u/RaftHardState: term:6 vote:0 commit:0
/Local/RangeID/5161/u/RaftLastIndex: 10
/Local/RangeID/5161/u/RangeLastReplicaGCTimestamp: 1467400220.102689870,0
/Local/RangeID/5161/u/RangeLastVerificationTimestamp: 1464276110.450528716,0

and in fact everything is there expect for a HardState. We definitely write a HardState in writeInitialState (i.e. when we create a new right-hand side of the split).

When we apply a snapshot though, we don't write a HardState - the HardState is usually written in the same invocation of handleRaftReady, but not in the same RocksDB batch - maybe the process crashed between application of the snapshot and writing of the HardState, and we're vulnerable here?

Not sure the above all checks out, but maybe it does - the range again looks like one of those odd ranges which potentially only contains one key and thus never saw a new write after it initially split off.

cc @bdarnell

tamird commented 8 years ago

Not much to add here, but the RaftTransport-related stack trace you posted (goroutine 117324 [select, 6 minutes]:) is expected; that's one of the two goroutines that live on the remote side of a streaming RPC.

tbg commented 8 years ago

Ah, good point. I think I fell for that before.

On Fri, Jul 1, 2016 at 6:02 PM Tamir Duberstein notifications@github.com wrote:

Not much to add here, but the RaftTransport-related stack trace you posted (goroutine 117324 [select, 6 minutes]:) is expected; that's one of the two goroutines that live on the remote side of a streaming RPC.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/6991#issuecomment-230057096, or mute the thread https://github.com/notifications/unsubscribe/AE135KRkyGAgNiEfca-XU0p6CPAWSFXIks5qRY5sgaJpZM4IryEz .

-- Tobias

tbg commented 8 years ago

I'm preparing a change to deal with the HardState appropriately. Appreciate feedback on the above.

cuongdo commented 8 years ago

I brought back the registration cluster with a newly initialized cluster so that new data can be recorded.

The prior data has been moved to /mnt/data/backup-20160705 on each machine. I wasn't able to reliably bring up any subset of the cluster to perform a SQL dump, so recovery of the existing data will need to wait until #7598 goes in..

Re-bootstrapping the cluster took longer than expected. The supervisor.conf file doesn't work for the first run. Every node kept trying to create a gossip connection to every other node while not making any progress on bootstrapping. The solution was to omit the --join parameter from the cockroach command line for the first node. After that, every other node successfully joined the cluster.

Also, I needed to remove the 10s raft tick interval @tschottdorf put in to try to revive the "old" cluster.

tamird commented 8 years ago

via @tschottdorf: raw data is preserved in s3, but has been dumped and imported into the new cluster.

cockroachdb / cockroach

stability: resurrecting registration cluster #6991