Closed featheryus closed 6 years ago
sync duration of ...
Means fsyncs are taking longer than it should in your machine. This possibly leads to leader elections, making the cluster unavailable.
Then, do you have any suggestion to avoid such problem. Except the slow disk, is there any other possible reason to trigger such problem. BTW, we use SSD in our system. Thanks.
etcd cluster could be overloaded. Do you have output of /metrics
?
Please inspect /metrics
output with https://github.com/coreos/etcd/blob/master/Documentation/op-guide/monitoring.md.
And reopen if you still think it's an etcd issue.
Here is another case related etcd failed. From the log I'm not sure if it same with #8707. So I report a new issue and upload all related log to make them clear. Maybe it's similar issue, but different scenario and new log may give you new view to trace the issue. Thanks.
Also there are three instance node, node-0, node-1, snode-2 process_etcd_user running in all three instance node. it is an user of etcd, it will supervise the etcd service. if it can't access etcd for a while. It will trigger node reboot to recover.
The problem is that , when restart the whole cluster ( node-1, node-1, snode-2), the etcd service work unstable.
attach all the log. etcd_node_0.log etcd_node_1.log etcd_snode_2.log