etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.43k stars 9.73k forks source link

etcd service failed due to sync duration? #8722

Closed featheryus closed 6 years ago

featheryus commented 6 years ago

Here is another case related etcd failed. From the log I'm not sure if it same with #8707. So I report a new issue and upload all related log to make them clear. Maybe it's similar issue, but different scenario and new log may give you new view to trace the issue. Thanks.

Also there are three instance node, node-0, node-1, snode-2 process_etcd_user running in all three instance node. it is an user of etcd, it will supervise the etcd service. if it can't access etcd for a while. It will trigger node reboot to recover.

The problem is that , when restart the whole cluster ( node-1, node-1, snode-2), the etcd service work unstable.

Oct 20 10:58:14 node-0 tipc_node_get_node[1162]: set value into etcd failed: 58632 (etcd can't work) Oct 20 10:58:15 node-0 tipc_node_get_node[1162]: Node id 1 successfully assigned to node-0(etcd works) Oct 20 10:58:29 node-0 etcd[911]: sync duration of 1.034111027s, expected less than 1s Oct 20 10:58:29 node-0 process_etcd_user[1976]: 52489700:_send_request__r:346: curl_easy_perform send request to etcd by url=http://127.0.0.1:2379/v2/keys/dha/supervision/node-0/etcd?quorum=true failed, curl ret=28 ( etcd not works)

attach all the log. etcd_node_0.log etcd_node_1.log etcd_snode_2.log

gyuho commented 6 years ago
sync duration of ...

Means fsyncs are taking longer than it should in your machine. This possibly leads to leader elections, making the cluster unavailable.

featheryus commented 6 years ago

Then, do you have any suggestion to avoid such problem. Except the slow disk, is there any other possible reason to trigger such problem. BTW, we use SSD in our system. Thanks.

gyuho commented 6 years ago

etcd cluster could be overloaded. Do you have output of /metrics?

gyuho commented 6 years ago

Please inspect /metrics output with https://github.com/coreos/etcd/blob/master/Documentation/op-guide/monitoring.md.

And reopen if you still think it's an etcd issue.