docker-archive / classicswarm

Swarm Classic: a container clustering system. Not to be confused with Docker Swarm which is at https://github.com/docker/swarmkit
Apache License 2.0
5.76k stars 1.08k forks source link

Swarm Master and Replica No elected primary cluster manager error. #1491

Closed ankitsnlq closed 8 years ago

ankitsnlq commented 8 years ago

I have two swarm manager, one is primary and another one is replica for it . But some time Master swarm gets role Replica and in primary row it's show self ip address as primary.I'm using consul for master election.

Here is what i get after docker info

Containers: 26
Images: 25
Role: replica
Primary: 172.16.0.197:4000
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 2
 arun-Latitude-3550: 172.16.0.247:2375
  └ Containers: 17
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 8.099 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=3.13.0-68-generic, operatingsystem=Ubuntu 14.04.3 LTS, storagedriver=aufs
 raw1-VirtualBox: 172.16.0.236:2375
  └ Containers: 9
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 3.086 GiB
  └ Labels: executiondriver=native-0.2, aContainers: 26

In the output we can see it is shwoing Replica but if you see the primary Ip line it is showing own ip address. So it got Role as Replica and even if it is primary.

Output of ip r l command

default via 172.16.0.1 dev eth0  proto static 
172.16.0.0/23 dev eth0  proto kernel  scope link  src 172.16.0.197  metric 1 
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1

In that situation if i give command to swarm manager it will show me docker ps

Error response from daemon: No elected primary cluster manager

After some time issue gets resolved and i get Role as Primary and docker ps command start working .

But after giving command docker ps -a

An error occurred trying to connect: Get http://0.0.0.0:4000/v1.21/containers/json?all=1: EOF

dongluochen commented 8 years ago

@ankitsnlq I think you cover 2 issues here. The first one looks like delay from Consul leader election https://www.consul.io/docs/guides/leader-election.html. The second issue on ps -a should be fixed by #1465.

abronan commented 8 years ago

@ankitsnlq Unfortunately the delay for Leader Election (and in this case I think re-election) is a specificity of Consul. You can mitigate the issue by using a --replication-ttl flag to a low value in case a re-election occurs somehow which seemed to be the case here (the Leader in Consul is not forever stable compared to etcd or zookeeper).

The issue with docker ps -a should be fixed with #1465, can you try with the latest swarm:master? If using the Swarm Image you can pull dockerswarm:swarm:latest.

ankitsnlq commented 8 years ago

@abronan Yes i'm using dockerswarm:swarm:latest stilll same issue. It begins after the No Leader election issue start. I will try with --replication-ttl flag and update for if issue occurs again.

abronan commented 8 years ago

@ankitsnlq Oh wait failed to catch the docker ps -a log at the end. It seems like this is a legitimate connection issue.

Can you give us more infos about your setup? Are you using docker-machine to create the cluster? If not are you setting up your Managers to use TLS (I see the Primary Manager using the port 4000 but the agents exposed on port 2375)?

Seems like this is a legitimate connection issue but can't be sure without more informations.

ankitsnlq commented 8 years ago

I'm not using Docker machine. Done all settings in Virtualbox with latest Docker, Swarm and Compose. Yes My swarm manager docker gcontainer running on port 4000 and agent docker container means the join command is using 2375. Things was working fine with this setting. TLS is not used because i was doing testing in local environment so no security things was done.

aluzzardi commented 8 years ago

Oh - it's dockerswarm/swarm:master actually

zengnjin commented 8 years ago

I have the same problem now ! When it happend I'm not sure .

abronan commented 8 years ago

Hi @zengnjin, which one?

The Leader Election issue or the error trying to connect?

Leader election issue on store failure with Consul was fixed by #1552. Please make sure you update to swarm:master or pull dockerswarm/swarm:master.

abronan commented 8 years ago

Hi @ankitsnlq. Any update on this one? Did you manage to make it work with master?

abronan commented 8 years ago

Closing, both issues should be fixed by now in master. Feel free to open a new issue if you still encounter any problem. Thanks for reporting!

EamonZhang commented 8 years ago

same problem . $ docker -H :4000 Containers: 0 Images: 0 Server Version: swarm/1.1.3 Role: replica Primary: Strategy: spread Filters: health, port, dependency, affinity, constraint Nodes: 0 Kernel Version: 3.10.0-327.el7.x86_64 Operating System: linux CPUs: 0 Total Memory: 0 B Name: e141a4722f25 $docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE docker.io/swarm latest 81127fe5e9b4 6 weeks ago 18.11 MB $ docker -H :4000 ps Error response from daemon: No elected primary cluster manager $docker logs level=error msg="client: etcd cluster is unavailable or misconfigured

abronan commented 8 years ago

Hi @EamonZhang, are you using Consul? In this case, can you try with swarm:1.2.0? Thanks.

EamonZhang commented 8 years ago

Hi @abronan ,I am using ectd instead of consul,and has the same problem . This time my swarm is version 1.2.0

abronan commented 8 years ago

@EamonZhang Oh I actually missed that piece of logs, but level=error msg="client: etcd cluster is unavailable or misconfigured suggests that the manager can't connect to the etcd server somehow through the client, are you using any kind of special setup for your etcd cluster, like Proxy mode or else?

EamonZhang commented 8 years ago

Hi @abronan
etcd problem was solved.but the problem still exists .

[root@node1 opt]# docker -H :4000 ps Error response from daemon: No elected primary cluster manager [root@node1 opt]# docker -H :4000 info Containers: 0 Images: 0 Server Version: swarm/1.2.0 Role: replica Primary: Strategy: spread Filters: health, port, dependency, affinity, constraint Nodes: 0 Kernel Version: 3.10.0-327.el7.x86_64 Operating System: linux CPUs: 0 Total Memory: 0 B Name: 98519415381a

ps : it is test ok with consul.

gregkeys commented 8 years ago

I'm getting this error when I connect to a manager replica, I do not get the error if I connect to the primary manager.

my setup consists of 9 servers, 3 for consul, 3 for swarm managers, 3 for front end misc. I am using aws, private IP's and 3 subnets 1 in each zone, a,b,c

assume the managers are setup as such manager.zone-a.01 - replica manager.zone-b.02 - replica manager.zone-c.03 - primary

when I connect to the swarm on a replica using something like this

eval "$(docker-machine env --swarm manager.zone-a.01)"
or
eval "$(docker-machine env --swarm manager.zone-b.02)"

I get the error Error response from daemon: No elected primary cluster manager

when I connect to the primary

eval "$(docker-machine env --swarm manager.zone-c.03)"

I dont get the error and everything works.