Closed fritzb closed 7 years ago
Hi @fritzb, thank you for reporting the issue.
What is the etcd version you are using? (Swarm does not support etcd < 2.0.0
)
Also can you still reproduce with swarm:1.2.0
? (I see that you are still using swarm 1.1.3
)
From the logs I would say that there are either connection drops between the Swarm Manager and the etcd server or between etcd servers (the getsFail
stat seems rather high to me) or that something external is trying to modify and delete the key (which seems unlikely). I also see that some delete operations failed so this might be related somehow (as the leader key is deleted on expiration but if the operation fails, other nodes will not be able to take the lock which may end up with the cluster being stuck).
Can't say for sure as each situation is a bit specific, especially regarding network connectivity between the Manager and the store.
I would say:
Thanks.
Hi @abronan
I'm using etcd 2.3.1 and will upgrade to swarm 1.2. The swarm manager is configured to talk to the local etcd in the same machine so unlikely there was a network connection problem btw manager and etcd. There may be network problems between etcd servers, but I didnt see any failure in the etcd logs.
I tried your first suggestion (I'm still waiting proper window to do the second suggestion):
I deleted the leader key manually on the leader node, and I do see several watch error and leadership lost but eventually stabilize:
time="2016-04-18T20:35:24Z" level=info msg="New leader elected: 172.31.197.249:2370" time="2016-04-18T20:35:28Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T20:36:18Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-18T20:37:10Z" level=info msg="New leader elected: 172.31.197.249:2370" time="2016-04-18T20:37:12Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T20:37:22Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-18T20:37:29Z" level=info msg="New leader elected: 172.31.197.248:2370" time="2016-04-18T20:37:34Z" level=info msg="New leader elected: 172.31.197.249:2370" time="2016-04-18T20:37:35Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T20:37:39Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-18T20:41:58Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T20:42:02Z" level=error msg="Discovery error: Unexpected watch error" time="2016-04-18T20:42:13Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-18T20:42:45Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T20:43:01Z" level=info msg="New leader elected: 172.31.197.248:2370" time="2016-04-18T20:43:16Z" level=error msg="Discovery error: Unexpected watch error" time="2016-04-18T20:43:29Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-18T20:46:07Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T20:46:26Z" level=info msg="New leader elected: 172.31.197.248:2370"
I notice a few failure again today (see below), and planning to upgrade to the swarm 1.2
time="2016-04-18T21:32:57Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T21:33:13Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-18T21:33:29Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T21:33:47Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-18T21:35:21Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T21:35:34Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-18T23:22:40Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T23:22:55Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-18T23:23:22Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T23:23:35Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-18T23:27:11Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-18T23:27:29Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-19T00:23:01Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-19T00:23:16Z" level=error msg="Discovery error: Unexpected watch error" time="2016-04-19T00:23:22Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-19T00:24:59Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-19T00:25:09Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-19T21:53:00Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-19T21:53:16Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-19T21:54:14Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-19T21:54:29Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-19T21:56:14Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-19T21:56:30Z" level=info msg="New leader elected: 172.31.197.249:2370" time="2016-04-19T21:57:10Z" level=info msg="New leader elected: 172.31.197.249:2370" time="2016-04-19T21:57:57Z" level=info msg="New leader elected: 172.31.197.248:2370" time="2016-04-19T21:58:32Z" level=error msg="Discovery error: Unexpected watch error" time="2016-04-19T21:59:28Z" level=info msg="New leader elected: 172.31.197.249:2370" time="2016-04-19T22:01:55Z" level=info msg="New leader elected: 172.31.197.248:2370" time="2016-04-19T22:02:35Z" level=info msg="New leader elected: 172.31.197.248:2370" time="2016-04-19T22:03:10Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-19T22:04:41Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-19T22:05:02Z" level=info msg="New leader elected: 172.31.197.248:2370" time="2016-04-19T22:05:23Z" level=info msg="New leader elected: 172.31.197.249:2370" time="2016-04-19T22:07:03Z" level=info msg="New leader elected: 172.31.197.248:2370" time="2016-04-19T22:07:24Z" level=info msg="New leader elected: 172.31.197.248:2370"
Thanks
The problem is still seen after upgrading to swarm 1.2
Any further request to the engine failed until cluster stabilize: "No elected primary cluster manager". I wonder if we can do better here. Are there timeout value that we can tune ?
time="2016-04-19T22:35:47Z" level=info msg="Registered Engine swsdk27 at swsdk27.insieme.local:2375" time="2016-04-19T22:35:51Z" level=info msg="Registered Engine swsdk26 at swsdk26.insieme.local:2375" time="2016-04-19T22:43:33Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-19T22:43:44Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-19T22:43:49Z" level=error msg="Discovery error: Unexpected watch error" time="2016-04-19T22:44:03Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-19T22:44:16Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-19T22:45:18Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-19T22:45:28Z" level=error msg="Discovery error: Unexpected watch error" time="2016-04-19T22:46:18Z" level=info msg="Leader Election: Cluster leadership acquired" time="2016-04-19T22:46:51Z" level=info msg="Leader Election: Cluster leadership lost" time="2016-04-19T22:47:05Z" level=info msg="New leader elected: 172.31.197.249:2370" time="2016-04-19T22:47:47Z" level=info msg="New leader elected: 172.31.197.248:2370"
Issue is resolved. The workload running on the primary cluster manager is using so many resources (file descriptor) which is causing the issue. Moving the workload to other node solve the issue.
I faced with same cause(docker swarm 1.12.3), When I run cluster swarm(Three Node) to connect to "cluster consul" via haproxy (haproxy to load balancing - active, backup, backup). Could you tell me how to solve the problem?. Cluster change leader manager every minute. :(
Dec 23 18:29:20 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:29:20Z" level=info msg="Leader Election: Cluster leadership lost"
Dec 23 18:29:20 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:29:20Z" level=info msg="New leader elected: 10.10.11.229:4000"
Dec 23 18:29:20 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:29:20Z" level=info msg="Leader Election: Cluster leadership acquired"
Dec 23 18:29:50 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:29:50Z" level=info msg="Leader Election: Cluster leadership lost"
Dec 23 18:29:50 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:29:50Z" level=info msg="New leader elected: 10.10.11.228:4000"
Dec 23 18:29:50 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:29:50Z" level=info msg="Leader Election: Cluster leadership acquired"
Dec 23 18:30:20 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:30:20Z" level=info msg="Leader Election: Cluster leadership lost"
Dec 23 18:30:20 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:30:20Z" level=info msg="New leader elected: 10.10.11.229:4000"
Dec 23 18:30:20 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:30:20Z" level=info msg="Leader Election: Cluster leadership acquired"
Dec 23 18:30:50 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:30:50Z" level=info msg="Leader Election: Cluster leadership lost"
Dec 23 18:30:51 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:30:51Z" level=info msg="Leader Election: Cluster leadership acquired"
Dec 23 18:31:21 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:31:21Z" level=info msg="Leader Election: Cluster leadership lost"
Dec 23 18:31:21 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:31:21Z" level=info msg="Leader Election: Cluster leadership acquired"
Dec 23 18:31:21 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:31:21Z" level=info msg="New leader elected: 10.10.11.228:4000"
Dec 23 18:31:51 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:31:51Z" level=info msg="Leader Election: Cluster leadership lost"
Dec 23 18:31:51 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:31:51Z" level=info msg="New leader elected: 10.10.11.229:4000"
Dec 23 18:31:51 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:31:51Z" level=info msg="Leader Election: Cluster leadership acquired"
Dec 23 18:32:21 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:32:21Z" level=info msg="Leader Election: Cluster leadership lost"
Dec 23 18:32:21 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:32:21Z" level=info msg="Leader Election: Cluster leadership acquired"
Dec 23 18:32:51 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:32:51Z" level=info msg="Leader Election: Cluster leadership lost"
Dec 23 18:32:51 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:32:51Z" level=info msg="Leader Election: Cluster leadership acquired"
Dec 23 18:32:51 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:32:51Z" level=info msg="New leader elected: 10.10.11.228:4000"
Dec 23 18:33:21 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:33:21Z" level=info msg="Leader Election: Cluster leadership lost"
Dec 23 18:33:21 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:33:21Z" level=info msg="New leader elected: 10.10.11.229:4000"
Dec 23 18:33:21 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:33:21Z" level=info msg="Leader Election: Cluster leadership acquired"
Dec 23 18:33:51 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:33:51Z" level=info msg="Leader Election: Cluster leadership lost"
Dec 23 18:33:51 10.10.11.229 swarm_stag_3[830]: time="2016-12-23T18:33:51Z" level=info msg="New leader elected: 10.10.11.228:4000"
Dec 23 18:33:51 10.10.11.228 swarm_stag_2[830]: time="2016-12-23T18:33:51Z" level=info msg="Leader Election: Cluster leadership acquired"
Hi,
This can be related to timeout value set on haproxy
Yes, If swarm connects to consul directly then Swarm cluster is normal. But when swarm connects to consul with haproxy then Swarm has issue like above. I have changed keep alive, time out option in haproxy configuration. but Everything is not better. Swarm connects to consul with http protocol. I don't think this problem related to time out. Have anybody high available consul by using haproxy?. Please get me some suggestion?.
You might need to adjust the timeout values in etcd. See here https://coreos.com/etcd/docs/latest/tuning.html
Hi All,
I need help to debug cluster leadership lost problem that we are facing. I have 3 machines, each machine runs swarm manager and etcd. The 3 etcds are forming a cluster, and the docker swarm managers are configured to form primary/replica.
I dont see any errors in the etcd logs, however swarm manager logs did show etcd watch error and lost of leadership. Could someone let me know the appropriate debug commands to troubleshoot the issues.
Thanks, Fritz
ETCD stats on the leader: