Open discordianfish opened 8 years ago
This can be reliably reproduced. My cluster consists of small DO instances. It seems like formatting a 5GB partition overloads the cluster. On the peer which has the partition attached, load jumps to ~14 due to IO wait. I guess this is expected due to nbd stalling with etcd re-electing. I've raised the election timeout to 5s but this didn't really help. I'll give it some more time to see if it recovers but then I'm pretty much out of option. It seems like something causing the whole VM to lock up for seconds which I assume causes the problems. And there seems to be an issue somewhere not allowing it to recover.
Interesting. This appears to be an interaction between etcd and Torus -- mostly on the etcd side. I'm adding @xiang90, who might have some insight to the etcd logs.
It seems like something causing the whole VM to lock up for seconds which I assume causes the problems. And there seems to be an issue somewhere not allowing it to recover.
If the VM is down, then etcd is down. Probably we need to figure out why the VM is down? Does etcd cause the VM down. Or the VM causes etcd down?
Incidentally, Torus is saying that it was operating normally, but then some requests timed out to etcd, so much so that 30s or more passed, and the lease was lost. 30s is an eon; something is going on.
Torus-side we can fail faster -- there's probably a bug in that a request like this should have a shorter client side timeout -- but that's a symptom, not a root cause.
@xiang90 Let me be more specific: While watching logs and running the mkfs.ext4 via ssh, I realized that my ssh session became stale. That could also mean it's just a networking issue but combined with the high load, it felt like it completely locked up. According to the digitalocean graphs though, general utilization was pretty low. Let me know if there is anything I can help to debug this. If you can't reproduce it locally, you can use my packer+terraform templates: https://github.com/5pi/infra
i can reproduce what appears to be this issue.
issuing sudo sgdisk --zap-all -n 1 -t 1:8300 /dev/nbd0 && sudo mkfs.ext4 /dev/nbd0p1
on a 10GiB volume results in the lease timing out, and also appears to cause blocks to go missing, resulting in a completely unusable volume that has to be recreated..
Hi,
running etcd 3.0.3 with TLS and torus with my TLS patches from #292. I can create the cluster, init it, add storage nodes, get the ring, create volumes - all fine. I can also attach to volume but when I try to format it, torusblk throws errors:
etcd log shows during that time:
On a peer, the log looks like this:
etcdctl cluster-health shows:
After stopping torusblk again, the cluster comes back and is healthy again. But if I now try to start torusblk again, I get those errors:
Even though the ring is there: