Formatting nbd device fails and causes etcd to become stale.

discordianfish commented 8 years ago

Hi,

running etcd 3.0.3 with TLS and torus with my TLS patches from #292. I can create the cluster, init it, add storage nodes, get the ring, create volumes - all fine. I can also attach to volume but when I try to format it, torusblk throws errors:

2016-07-29 07:00:17.434147 W | torus: failed to create or renew lease: rpc error: code = 4 desc = context deadline exceeded
2016-07-29 07:00:17.434320 W | torus: couldn't register heartbeat: no lease
2016-07-29 07:00:19.105473 W | torus: couldn't update peerlist: rpc error: code = 4 desc = context deadline exceeded
2016-07-29 07:00:30.492683 W | torus: failed to create or renew lease: rpc error: code = 4 desc = context deadline exceeded
2016-07-29 07:00:30.492809 W | torus: couldn't register heartbeat: no lease

etcd log shows during that time:

Jul 29 07:00:36 master0 etcd[2940]: 83a72509fc4c19fc [term: 23] received a MsgVote message with higher term from 42c9c3f43d2962e6 [term: 24]
Jul 29 07:00:36 master0 etcd[2940]: 83a72509fc4c19fc became follower at term 24
Jul 29 07:00:36 master0 etcd[2940]: 83a72509fc4c19fc [logterm: 20, index: 18081, vote: 0] voted for 42c9c3f43d2962e6 [logterm: 20, index: 18094] at term 24
Jul 29 07:00:36 master0 etcd[2940]: raft.node: 83a72509fc4c19fc elected leader 42c9c3f43d2962e6 at term 24
Jul 29 07:00:45 master0 etcd[2940]: closed an existing TCP streaming connection with peer 42c9c3f43d2962e6 (stream MsgApp v2 writer)
Jul 29 07:00:45 master0 etcd[2940]: established a TCP streaming connection with peer 42c9c3f43d2962e6 (stream MsgApp v2 writer)
Jul 29 07:00:45 master0 etcd[2940]: closed an existing TCP streaming connection with peer 8c62c59fae85f284 (stream MsgApp v2 writer)
Jul 29 07:00:45 master0 etcd[2940]: established a TCP streaming connection with peer 8c62c59fae85f284 (stream MsgApp v2 writer)
Jul 29 07:00:45 master0 etcd[2940]: closed an existing TCP streaming connection with peer 8c62c59fae85f284 (stream Message writer)
Jul 29 07:00:45 master0 etcd[2940]: established a TCP streaming connection with peer 8c62c59fae85f284 (stream Message writer)
Jul 29 07:01:14 master0 etcd[2940]: failed to write 42c9c3f43d2962e6 on stream Message (write tcp 10.130.0.1:2380->10.130.2.1:59042: write: broken pipe)
Jul 29 07:01:14 master0 etcd[2940]: peer 42c9c3f43d2962e6 became inactive
Jul 29 07:01:14 master0 etcd[2940]: lost the TCP streaming connection with peer 42c9c3f43d2962e6 (stream Message writer)
Jul 29 07:01:17 master0 etcd[2940]: peer 42c9c3f43d2962e6 became active
Jul 29 07:01:17 master0 etcd[2940]: established a TCP streaming connection with peer 42c9c3f43d2962e6 (stream Message writer)

On a peer, the log looks like this:

Jul 29 07:00:36 master1 etcd[3073]: 8c62c59fae85f284 became follower at term 24
Jul 29 07:00:36 master1 etcd[3073]: 8c62c59fae85f284 [logterm: 20, index: 18094, vote: 0] voted for 42c9c3f43d2962e6 [logterm: 20, index: 18094] at term 24
Jul 29 07:00:36 master1 etcd[3073]: raft.node: 8c62c59fae85f284 elected leader 42c9c3f43d2962e6 at term 24
Jul 29 07:00:45 master1 etcd[3073]: lost the TCP streaming connection with peer 83a72509fc4c19fc (stream Message reader)
Jul 29 07:00:45 master1 etcd[3073]: failed to read 83a72509fc4c19fc on stream Message (read tcp 10.130.1.1:56732->10.130.0.1:2380: i/o timeout)
Jul 29 07:00:45 master1 etcd[3073]: peer 83a72509fc4c19fc became inactive
Jul 29 07:00:45 master1 etcd[3073]: lost the TCP streaming connection with peer 83a72509fc4c19fc (stream MsgApp v2 reader)
Jul 29 07:00:45 master1 etcd[3073]: peer 83a72509fc4c19fc became active
Jul 29 07:00:45 master1 etcd[3073]: established a TCP streaming connection with peer 83a72509fc4c19fc (stream MsgApp v2 reader)
Jul 29 07:00:45 master1 etcd[3073]: established a TCP streaming connection with peer 83a72509fc4c19fc (stream Message reader)
Jul 29 07:01:13 master1 etcd[3073]: lost the TCP streaming connection with peer 83a72509fc4c19fc (stream MsgApp v2 reader)

etcdctl cluster-health shows:

member 42c9c3f43d2962e6 is healthy: got healthy result from https://10.130.2.1:2379
member 83a72509fc4c19fc is unhealthy: got unhealthy result from https://10.130.0.1:2379
member 8c62c59fae85f284 is healthy: got healthy result from https://10.130.1.1:2379
cluster is healthy

After stopping torusblk again, the cluster comes back and is healthy again. But if I now try to start torusblk again, I get those errors:

torusblk nbd pg1 /dev/nbd0
2016-07-29 07:38:34.438029 W | distributor: block br 10000000003 : 5 : 1 from 95a33249-5577-11e6-9387-04013eff5a01 failed, trying next peer
2016-07-29 07:38:34.439979 W | distributor: block br 10000000003 : 5 : 1 from 9c0ca7bf-5577-11e6-8dc9-04013eff5901 failed, trying next peer
2016-07-29 07:38:34.450848 W | distributor: block br 10000000003 : 5 : 1 from 984db060-5577-11e6-9928-04013eff5b01 failed, trying next peer

Even though the ring is there:

root@master0:~# torusctl list-peers
ADDRESS                  UUID                                  SIZE    USED     MEMBER  UPDATED        REB/REP DATA
http://10.130.1.1:40000  95a33249-5577-11e6-9387-04013eff5a01  20 GiB  7.5 MiB  OK      5 seconds ago  0 B/sec
http://10.130.2.1:40000  984db060-5577-11e6-9928-04013eff5b01  20 GiB  8.0 MiB  OK      3 seconds ago  0 B/sec
http://10.130.0.1:40000  9c0ca7bf-5577-11e6-8dc9-04013eff5901  20 GiB  12 MiB   OK      2 seconds ago  0 B/sec
Balanced: true Usage:  0.05%
root@master0:~# torusctl ring get
Ring: Ketama
Replication:2
Peers:
    uuid:"95a33249-5577-11e6-9387-04013eff5a01" address:"http://10.130.1.1:40000" last_seen:1469789633831746707 total_blocks:40960 rebalance_info:<> protocol_version:1 
    uuid:"984db060-5577-11e6-9928-04013eff5b01" address:"http://10.130.2.1:40000" last_seen:1469789633675466023 total_blocks:40960 rebalance_info:<> protocol_version:1 
    uuid:"9c0ca7bf-5577-11e6-8dc9-04013eff5901" address:"http://10.130.0.1:40000" last_seen:1469789633558752482 total_blocks:40960 rebalance_info:<> protocol_version:1

discordianfish commented 8 years ago

This can be reliably reproduced. My cluster consists of small DO instances. It seems like formatting a 5GB partition overloads the cluster. On the peer which has the partition attached, load jumps to ~14 due to IO wait. I guess this is expected due to nbd stalling with etcd re-electing. I've raised the election timeout to 5s but this didn't really help. I'll give it some more time to see if it recovers but then I'm pretty much out of option. It seems like something causing the whole VM to lock up for seconds which I assume causes the problems. And there seems to be an issue somewhere not allowing it to recover.

barakmich commented 8 years ago

Interesting. This appears to be an interaction between etcd and Torus -- mostly on the etcd side. I'm adding @xiang90, who might have some insight to the etcd logs.

xiang90 commented 8 years ago

It seems like something causing the whole VM to lock up for seconds which I assume causes the problems. And there seems to be an issue somewhere not allowing it to recover.

If the VM is down, then etcd is down. Probably we need to figure out why the VM is down? Does etcd cause the VM down. Or the VM causes etcd down?

barakmich commented 8 years ago

Incidentally, Torus is saying that it was operating normally, but then some requests timed out to etcd, so much so that 30s or more passed, and the lease was lost. 30s is an eon; something is going on.

Torus-side we can fail faster -- there's probably a bug in that a request like this should have a shorter client side timeout -- but that's a symptom, not a root cause.

discordianfish commented 8 years ago

@xiang90 Let me be more specific: While watching logs and running the mkfs.ext4 via ssh, I realized that my ssh session became stale. That could also mean it's just a networking issue but combined with the high load, it felt like it completely locked up. According to the digitalocean graphs though, general utilization was pretty low. Let me know if there is anything I can help to debug this. If you can't reproduce it locally, you can use my packer+terraform templates: https://github.com/5pi/infra

mischief commented 7 years ago

i can reproduce what appears to be this issue.

issuing sudo sgdisk --zap-all -n 1 -t 1:8300 /dev/nbd0 && sudo mkfs.ext4 /dev/nbd0p1 on a 10GiB volume results in the lease timing out, and also appears to cause blocks to go missing, resulting in a completely unusable volume that has to be recreated..

coreos / torus

Formatting nbd device fails and causes etcd to become stale. #294