docker-archive / for-aws

92 stars 26 forks source link

Containers communication is failing between nodes #166

Open mayacher opened 6 years ago

mayacher commented 6 years ago

Expected behavior

Containers on a new node on a cluster will be able to communicate with other containers on already existed cluster.

Actual behavior

Although the node is part of the swarm and sg-security are open with 2377, 4789 and 7946 containers are unable to telnet each other. Using telnet between hosts works. For example, if port 8545 is exposed it is reachable to the host ip from outside and other hosts but containers in other hosts are unable to connect. Resolving works. Containers on the new node are able to access containers on already exists swarm but not vice versa.

Information

I keep having this logs on all nodes (managers and workers) Aug 19 16:10:06 docker-swarm-worker2 dockerd[25146]: time="2018-08-19T16:10:06.500461160Z" level=warning msg="memberlist: Was able to connect to 06ad93a6dfb1 but other probes failed, network may be misconfigured" Aug 19 16:10:09 docker-swarm-worker2 dockerd[25146]: time="2018-08-19T16:10:09.499979293Z" level=warning msg="memberlist: Was able to connect to 06ad93a6dfb1 but other probes failed, network may be misconfigured" Aug 19 16:10:11 docker-swarm-worker2 dockerd[25146]: time="2018-08-19T16:10:11.499986339Z" level=warning msg="memberlist: Was able to connect to dc8b9e79cbdf but other probes failed, network may be misconfigured" Aug 19 16:10:13 docker-swarm-worker2 dockerd[25146]: time="2018-08-19T16:10:13.500044397Z" level=warning msg="memberlist: Was able to connect to dc8b9e79cbdf but other probes failed, network may be misconfigured" Aug 19 16:10:14 docker-swarm-worker2 dockerd[25146]: time="2018-08-19T16:10:14.500187598Z" level=warning msg="memberlist: Was able to connect to 06ad93a6dfb1 but other probes failed, network may be misconfigured

There is no documentation about this error.

Steps to reproduce the behavior

  1. create a swarm with 2 masters private subnet (nat) (2 availability zones)
  2. add 2 workers private subnet (2 availability zones)
  3. add another node on public subnet (adding to cluster with private ip)

docker network ls f2f4fb3b7faf bridge bridge local 4dxl2odb9a40 coinsmarketplace_default overlay swarm 62f44bb057c2 docker_gwbridge bridge local bd41f1a32a17 host host local g4gf7vvcfdat ingress overlay swarm bc13d12817b6 none null local

Thanks,

mayacher commented 6 years ago

Yeah, it wasn't security groups but network acl that caused the issue

michaelbukachi commented 4 years ago

Hi @mayacher do you mind elaborating this issue further? I'm experiencing the same thing.