docker-archive / classicswarm

Swarm Classic: a container clustering system. Not to be confused with Docker Swarm which is at https://github.com/docker/swarmkit
Apache License 2.0
5.76k stars 1.08k forks source link

Deaf nodes would like to join the party #1622

Closed fmarmori closed 8 years ago

fmarmori commented 8 years ago

Swarm (v1.0.1) nodes are not listening as supposed on port :2375.

vagrant@node02:~$ docker info
Containers: 3
Images: 35
Server Version: 1.9.1
Storage Driver: vfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.2.0-29-virtual
Operating System: <unknown>
CPUs: 1
Total Memory: 988.7 MiB
Name: node02
ID: 3D7F:BN2N:UVNB:NXR5:O7LJ:GETA:2DJL:PI7D:BKM4:FING:I5XV:PWGE
vagrant@node02:~$ uname -a
Linux node02 3.2.0-29-virtual #46-Ubuntu SMP Fri Jul 27 17:23:50 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
vagrant@node02:~$ docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS                          PORTS                                                                                                                                                          NAMES
132ea657481e        swarm                    "/swarm join --advert"   26 minutes ago      Up 26 minutes                   0.0.0.0:2375->2375/tcp                                                                                                                                         swarm
3a4229aea10f        gliderlabs/registrator   "/bin/registrator con"   About an hour ago   Restarting (1) 25 minutes ago                                                                                                                                                                  registrator
35d7309b4da6        progrium/consul          "/bin/start -server -"   About an hour ago   Up About an hour                0.0.0.0:53->53/tcp, 0.0.0.0:8300-8302->8300-8302/tcp, 0.0.0.0:8400->8400/tcp, 0.0.0.0:8500->8500/tcp, 0.0.0.0:8301-8302->8301-8302/udp, 0.0.0.0:8600->53/udp   consul
vagrant@node02:~$ sudo netstat -lptn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:52205           0.0.0.0:*               LISTEN      671/rpc.statd
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      652/rpcbind
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1672/sshd
tcp6       0      0 :::2375                 :::*                    LISTEN      5888/docker-proxy
tcp6       0      0 :::8300                 :::*                    LISTEN      870/docker-proxy
tcp6       0      0 :::8301                 :::*                    LISTEN      854/docker-proxy
tcp6       0      0 :::48333                :::*                    LISTEN      671/rpc.statd
tcp6       0      0 :::8302                 :::*                    LISTEN      836/docker-proxy
tcp6       0      0 :::111                  :::*                    LISTEN      652/rpcbind
tcp6       0      0 :::8400                 :::*                    LISTEN      823/docker-proxy
tcp6       0      0 :::8500                 :::*                    LISTEN      813/docker-proxy
tcp6       0      0 :::53                   :::*                    LISTEN      879/docker-proxy
tcp6       0      0 :::22                   :::*                    LISTEN      1672/sshd
vagrant@node02:~$ docker exec -it swarm bash
root@132ea657481e:/var/lib/docker/vfs/dir/132ea657481eb6602a9bd31d171cb111ebeeb335ecbd1765aca73bba2d371112# ps aux | grep swarm
root      5894  0.0  0.5  19684  6016 ?        Ssl  11:08   0:00 /swarm join --advertise=192.168.100.12:2375 consul://node02.local:8500/swarm
vagrant   6171  0.1  1.3 138704 13496 pts/0    Sl+  11:44   0:00 docker exec -it swarm bash
root      6193  0.0  0.0   6504   628 pts/2    S+   11:44   0:00 grep --color=auto swarm
root@132ea657481e:/var/lib/docker/vfs/dir/132ea657481eb6602a9bd31d171cb111ebeeb335ecbd1765aca73bba2d371112# netstat -lptn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
<empty>

In consul everything is fine. The join process boots correctly (no errors), but the manager just fail to connect... for obvious reasons. Here is the output of the manager:

level=error msg="Get http://192.168.100.12:2375/v1.15/info: dial tcp 192.168.100.12:2375: getsockopt: connection refused"
abronan commented 8 years ago

@fmarmori We included better node management in swarm:master so that nodes are being marked as unhealthy rather than not being registered at all. Thus you can see those nodes on docker info at the pending state waiting for validation.

So if the Manager can't connect to the Swarm node, It'll still register this node and output the error message to help debugging connectivity issues like the one you described.

I'm guessing that this is what you're looking for but let me know if I missed an important detail :smile:

fmarmori commented 8 years ago

@abronan Thank you for the quick feedback. I fully understand the purpose. However, in my installation I can't get any state information about the nodes from the master, this is the output of the master docker info

Containers: 0
Images: 0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 0
CPUs: 0
Total Memory: 0 B
Name: 80f9c8ac62e

I can also add that going back to version 1.0.0 do not solve the issue.

abronan commented 8 years ago

Did you try with swarm 1.0.1 or swarm:master building the latest binary? It was just included so maybe you had a wrong build?

/cc @dongluochen Any idea?

Make sure you pull the latest swarm Image: dockerswarm:master, or build the latest binary. If this still persists with master could you please provide the full output of swarm manage with the --debug flag? That would be helpful. Thanks!

dongluochen commented 8 years ago

@fmarmori I want to clarify how you start your swarm join on node02. Can you show your command? I see port mapping 0.0.0.0:2375->2375/tcp which doesn't look right here. Swarm join doesn't need to open host TCP port. It just registers docker daemon's service endpoint to your consul. What is docker daemon's TCP port? Make sure --advertise equals to node02_ip:daemon_port.

CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS                          PORTS                                                                                                                                                          NAMES
132ea657481e        swarm                    "/swarm join --advert"   26 minutes ago      Up 26 minutes                   0.0.0.0:2375->2375/tcp                                                                                                                                         swarm
fmarmori commented 8 years ago

@dongluochen I think the mapping is required since I'm testing a multi-host deployment. Here is the full docker run command of one of my nodes:

docker run -d -p 2375:2375 swarm join --advertise=192.168.100.11:2375 consul://192.168.100.10:8500/swarm
dongluochen commented 8 years ago

@fmarmori This doesn't look right. Swarm join only registers your docker daemon to consul. On node02 there should be 2 processes. The first one runs docker daemon where you can access it from external with docker -H node02:docker_port info. The second process runs swarm join which registers node02:docker_port to consul. I don't think port 2375 is your docker daemon port, otherwise your command docker run -d -p 2375:2375 swarm join --advertise=192.168.100.11:2375 consul://192.168.100.10:8500/swarm should fail because port 2375 is used by docker daemon. Please run docker -H node02:2375 info to validate.

fmarmori commented 8 years ago

@dongluochen Thanks for the hint. I found the issue. Actually my docker vhosts where listening only on the unix socket. I had to fix it in the /etc/default/docker file, by adding an additional -H tcp://<node-ip>:2375. Thanks for your support guys.