docker-archive / classicswarm

Swarm Classic: a container clustering system. Not to be confused with Docker Swarm which is at https://github.com/docker/swarmkit
Apache License 2.0
5.75k stars 1.08k forks source link

Containers unable to communicate with containers of other nodes inside overlay network #2954

Closed thacoon closed 5 years ago

thacoon commented 5 years ago

I have an overlay network web but my containers are unable to communicate over the network. I have two servers. One server, the manager node, contains a traefik docker. The other server, a worker node, contains my application, a database and a nginx container. If I inspect the network web you can see that both servers appear as peers but only the local containers are shown in containers.

I am using Docker version 18.09.8, build 0dd43dd87f.

My steps to set it up was as following:

# On manager node
$ docker swarm init --listen-addr 195.201.130.183 --advertise-addr 195.201.130.183
$ docker stack deploy -c docker-compose.yml traefik
# Create overlay network
$ docker network create --driver=overlay --attachable web
# On worker node
$ docker swarm join --token XXXX --listen-addr 195.201.130.204 --advertise-addr 195.201.130.204 195.201.130.183:2377
$ docker stack deploy -c docker-compose.yml --with-registry-auth mvp

docker network inspect web on manager node

[
    {
        "Name": "web",
        "Id": "ncny4mq03qy0wwwgb0xai9q0m",
        "Created": "2019-07-20T20:59:31.383292487+02:00",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.6.0/24",
                    "Gateway": "10.0.6.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "fd748c0b5129f84beb089df0d107668069292945b9d0d33e036c6dd5354d9407": {
                "Name": "traefik_traefik.1.b6w7xxax84b2t9qcluu14taxf",
                "EndpointID": "a2aa711be2683ff9ca4616960b364468b2cd77d3e414ac16bc04ab5ddb80ba8c",
                "MacAddress": "02:42:0a:00:06:03",
                "IPv4Address": "10.0.6.3/24",
                "IPv6Address": ""
            },
            "lb-web": {
                "Name": "web-endpoint",
                "EndpointID": "28cd2174c7bca52046b84e2aa9c189e7c68df067deb45ced9908145105d14893",
                "MacAddress": "02:42:0a:00:06:04",
                "IPv4Address": "10.0.6.4/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4103"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "4b9d7b36d2ed",
                "IP": "195.201.130.183"
            },
            {
                "Name": "d34f810f558e",
                "IP": "195.201.130.204"
            }
        ]
    }
]

docker network inspect web on manager node

[
    {
        "Name": "web",
        "Id": "ncny4mq03qy0wwwgb0xai9q0m",
        "Created": "2019-07-20T20:59:38.975594371+02:00",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.6.0/24",
                    "Gateway": "10.0.6.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "931d79c943e89d8c44bb18322135c83334c2cba63d4ffa816b9aaae0e3e62b07": {
                "Name": "mvp_nginx.1.teeczuekh5mls9a57n2p1xeoo",
                "EndpointID": "2be9c2d9cc2a5b8c79d38dec3c547d91457dca8e62a5a5cc2c9374070ce506dc",
                "MacAddress": "02:42:0a:00:06:0b",
                "IPv4Address": "10.0.6.11/24",
                "IPv6Address": ""
            },
            "cbe5f893fdcad6e6d3f6b9cc1ab03a31f5724d0b4becd89d95ca4ac513d21bb8": {
                "Name": "mvp_app.1.uxa6kxtvtz1sgw7hit6rfpsmh",
                "EndpointID": "88fb88c6a848b89601c36623ec18f533a13394113cd81fc4e7d99b0aa476f544",
                "MacAddress": "02:42:0a:00:06:06",
                "IPv4Address": "10.0.6.6/24",
                "IPv6Address": ""
            },
            "d3158d911f845ee8101e76bf01b3d4a7a5fb476c5631160329a339e1eaf43e78": {
                "Name": "mvp_postgis.1.nqnu8rly3lmpa57vxlplug58v",
                "EndpointID": "4313692dee7dca00eef355d9f4bf947c3c907710d396363db9bf75fbffe2b405",
                "MacAddress": "02:42:0a:00:06:09",
                "IPv4Address": "10.0.6.9/24",
                "IPv6Address": ""
            },
            "lb-web": {
                "Name": "web-endpoint",
                "EndpointID": "eb70cc6f338cf1ec8140568ee21e7c02fdf349fe80d87727962de32dd9d0e093",
                "MacAddress": "02:42:0a:00:06:07",
                "IPv4Address": "10.0.6.7/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4103"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "4b9d7b36d2ed",
                "IP": "195.201.130.183"
            },
            {
                "Name": "d34f810f558e",
                "IP": "195.201.130.204"
            }
        ]
    }
]

I have allowed the ports: ufw status && netstat -tulpn on manager node:

Status: active

To                         Action      From
--                         ------      ----
22/tcp                     ALLOW       Anywhere                  
2376/tcp                   ALLOW       Anywhere                  
2377/tcp                   ALLOW       Anywhere                  
7946/tcp                   ALLOW       Anywhere                  
7946/udp                   ALLOW       Anywhere                  
4789/udp                   ALLOW       Anywhere                  
22/tcp (v6)                ALLOW       Anywhere (v6)             
2376/tcp (v6)              ALLOW       Anywhere (v6)             
2377/tcp (v6)              ALLOW       Anywhere (v6)             
7946/tcp (v6)              ALLOW       Anywhere (v6)             
7946/udp (v6)              ALLOW       Anywhere (v6)             
4789/udp (v6)              ALLOW       Anywhere (v6)

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      4117/systemd-resolv 
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1081/sshd           
tcp        0      0 195.201.130.183:2377    0.0.0.0:*               LISTEN      17789/dockerd       
tcp        0      0 195.201.130.183:7946    0.0.0.0:*               LISTEN      17789/dockerd       
tcp6       0      0 :::8080                 :::*                    LISTEN      17789/dockerd       
tcp6       0      0 :::80                   :::*                    LISTEN      17789/dockerd       
tcp6       0      0 :::22                   :::*                    LISTEN      1081/sshd           
tcp6       0      0 :::443                  :::*                    LISTEN      17789/dockerd       
udp        0      0 195.201.130.183:7946    0.0.0.0:*                           17789/dockerd       
udp        0      0 127.0.0.53:53           0.0.0.0:*                           4117/systemd-resolv 
udp        0      0 0.0.0.0:68              0.0.0.0:*                           800/dhclient        
udp        0      0 0.0.0.0:4789            0.0.0.0:*                           -                   

ufw status && netstat -tulpn on worker node:


To                         Action      From
--                         ------      ----
22/tcp                     ALLOW       Anywhere                  
2376/tcp                   ALLOW       Anywhere                  
7946/tcp                   ALLOW       Anywhere                  
7946/udp                   ALLOW       Anywhere                  
4789/udp                   ALLOW       Anywhere                  
22/tcp (v6)                ALLOW       Anywhere (v6)             
2376/tcp (v6)              ALLOW       Anywhere (v6)             
7946/tcp (v6)              ALLOW       Anywhere (v6)             
7946/udp (v6)              ALLOW       Anywhere (v6)             
4789/udp (v6)              ALLOW       Anywhere (v6)

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      11355/systemd-resol 
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1090/sshd           
tcp        0      0 195.201.130.204:7946    0.0.0.0:*               LISTEN      18127/dockerd       
tcp6       0      0 :::8080                 :::*                    LISTEN      18127/dockerd       
tcp6       0      0 :::80                   :::*                    LISTEN      18127/dockerd       
tcp6       0      0 :::22                   :::*                    LISTEN      1090/sshd           
tcp6       0      0 :::443                  :::*                    LISTEN      18127/dockerd       
udp        0      0 195.201.130.204:7946    0.0.0.0:*                           18127/dockerd       
udp        0      0 127.0.0.53:53           0.0.0.0:*                           11355/systemd-resol 
udp        0      0 0.0.0.0:68              0.0.0.0:*                           793/dhclient        
udp        0      0 0.0.0.0:4789            0.0.0.0:*                           -                   

Edit:

thacoon commented 5 years ago

I have checked the logs and found that I have a network error:

level=error msg="error reading the kernel parameter net.ipv4.vs.expire_nodest_conn" error="open /proc/sys/net/ipv4/vs/expire_nodest_conn: no such file or directory"

$ journalctl -fu docker.service
-- Logs begin at Fri 2019-07-26 15:28:10 CEST. --
Jul 26 15:37:35 traefik1 dockerd[5674]: time="2019-07-26T15:37:35.056952711+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:i9grxo7b0xxc7dqj9h6ablmx4 leaving:true netPeers:0 entries:3 Queue qLen:0 netMsg/s:0"
Jul 26 15:37:35 traefik1 dockerd[5674]: time="2019-07-26T15:37:35.059081426+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:ertvdlvgi0574haq1s1g4ro7n leaving:false netPeers:2 entries:4 Queue qLen:0 netMsg/s:0"
Jul 26 15:42:35 traefik1 dockerd[5674]: time="2019-07-26T15:42:35.256324150+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:i9grxo7b0xxc7dqj9h6ablmx4 leaving:true netPeers:0 entries:3 Queue qLen:0 netMsg/s:0"
Jul 26 15:42:35 traefik1 dockerd[5674]: time="2019-07-26T15:42:35.257912878+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:ertvdlvgi0574haq1s1g4ro7n leaving:false netPeers:2 entries:4 Queue qLen:0 netMsg/s:0"
Jul 26 15:44:27 traefik1 dockerd[5674]: time="2019-07-26T15:44:27.094119252+02:00" level=info msg="initialized VXLAN UDP port to 4789 "
Jul 26 15:44:27 traefik1 dockerd[5674]: time="2019-07-26T15:44:27.415494640+02:00" level=error msg="error reading the kernel parameter net.ipv4.vs.expire_nodest_conn" error="open /proc/sys/net/ipv4/vs/expire_nodest_conn: no such file or directory"
Jul 26 15:47:35 traefik1 dockerd[5674]: time="2019-07-26T15:47:35.456526093+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:i9grxo7b0xxc7dqj9h6ablmx4 leaving:false netPeers:1 entries:6 Queue qLen:0 netMsg/s:0"
Jul 26 15:47:35 traefik1 dockerd[5674]: time="2019-07-26T15:47:35.458949597+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:ertvdlvgi0574haq1s1g4ro7n leaving:false netPeers:2 entries:6 Queue qLen:0 netMsg/s:0"
Jul 26 15:52:35 traefik1 dockerd[5674]: time="2019-07-26T15:52:35.656514539+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:i9grxo7b0xxc7dqj9h6ablmx4 leaving:false netPeers:1 entries:6 Queue qLen:0 netMsg/s:0"
Jul 26 15:52:35 traefik1 dockerd[5674]: time="2019-07-26T15:52:35.658840449+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:ertvdlvgi0574haq1s1g4ro7n leaving:false netPeers:2 entries:6 Queue qLen:0 netMsg/s:0"
Jul 26 15:57:35 traefik1 dockerd[5674]: time="2019-07-26T15:57:35.856568291+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:i9grxo7b0xxc7dqj9h6ablmx4 leaving:false netPeers:2 entries:16 Queue qLen:0 netMsg/s:0"
Jul 26 15:57:35 traefik1 dockerd[5674]: time="2019-07-26T15:57:35.857931138+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:ertvdlvgi0574haq1s1g4ro7n leaving:false netPeers:2 entries:6 Queue qLen:0 netMsg/s:0"
Jul 26 16:02:36 traefik1 dockerd[5674]: time="2019-07-26T16:02:36.057021608+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:i9grxo7b0xxc7dqj9h6ablmx4 leaving:false netPeers:2 entries:16 Queue qLen:0 netMsg/s:0"
Jul 26 16:02:36 traefik1 dockerd[5674]: time="2019-07-26T16:02:36.058673033+02:00" level=info msg="NetworkDB stats traefik1(00c1cc886d65) - netID:ertvdlvgi0574haq1s1g4ro7n leaving:false netPeers:2 entries:6 Queue qLen:0 netMsg/s:0"

However if I run the ping command in one container of the working node I can connect to the traefik container on the manager node.

$ docker exec -it b84cbe395874 /bin/bash
$ ping entry_traefik
PING entry_traefik (10.0.0.6): 56 data bytes
64 bytes from 10.0.0.6: seq=0 ttl=64 time=1.360 ms
64 bytes from 10.0.0.6: seq=1 ttl=64 time=0.196 ms
64 bytes from 10.0.0.6: seq=2 ttl=64 time=0.183 ms

Edit:

thacoon commented 5 years ago

I have fixed it. This was NOT a docker swarm issue.

I had misconfigured my docker-compose files. When using docker-swarm and traefik the traefik labels in the docker-compose file needs to be in the deploy: section. Like:

services:
    app:
        image: ...
        ....
        deploy:
            labels:
                - "traefik.docker.network=web"
                ...