coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
147 stars 30 forks source link

Connectivity issues between docker containers in 1745.4.0 #2442

Open philhug opened 6 years ago

philhug commented 6 years ago

Issue Report

Bug

Since updating from 1688.5.3 to 1745.4.0 we experienced connectivity issues between docker containers and also from the toolbox. After rollback to 1688.5.3 the problem disappeared.

[root@coreos-4-p ~]# telnet 172.17.0.3 3306
Trying 172.17.0.3...
telnet: connect to address 172.17.0.3: Connection refused
` ``
excerpt from docker inspect:
    "NetworkSettings": {
        "Bridge": "",
        "SandboxID": "6421f0b72d7365b272a8ff29a08c539bd5446544ca173a38f8e0c0c2dae80844",
        "HairpinMode": false,
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "Ports": {
            "3306/tcp": null
        },
        "SandboxKey": "/var/run/docker/netns/6421f0b72d73",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null,
        "EndpointID": "f39ae5bd585eabde5888cc615465ead53d7f6ccdfbb201a950aa4c1ee85411c8",
        "Gateway": "172.17.0.1",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "IPAddress": "172.17.0.3",
        "IPPrefixLen": 16,
        "IPv6Gateway": "",
        "MacAddress": "02:42:ac:11:00:03",
        "Networks": {
            "bridge": {
                "IPAMConfig": null,
                "Links": null,
                "Aliases": null,
                "NetworkID": "6eae708ea144820580570e835d9085a5697e9f3645e507df958a35ed689e8927",
                "EndpointID": "f39ae5bd585eabde5888cc615465ead53d7f6ccdfbb201a950aa4c1ee85411c8",
                "Gateway": "172.17.0.1",
                "IPAddress": "172.17.0.3",
                "IPPrefixLen": 16,
                "IPv6Gateway": "",
                "GlobalIPv6Address": "",
                "GlobalIPv6PrefixLen": 0,
                "MacAddress": "02:42:ac:11:00:03",
                "DriverOpts": null
            }
        }
    }
### Container Linux Version ###

$ cat /etc/os-release (back on working NAME="Container Linux by CoreOS" ID=coreos VERSION=1688.5.3 VERSION_ID=1688.5.3 BUILD_ID=2018-04-03-0547 PRETTY_NAME="Container Linux by CoreOS 1688.5.3 (Rhyolite)" ANSI_COLOR="38;5;75" HOME_URL="https://coreos.com/" BUG_REPORT_URL="https://issues.coreos.com" COREOS_BOARD="amd64-usr"



### Environment ###

What hardware/cloud provider/hypervisor is being used to run Container Linux?
Exoscale

### Expected Behavior ###
Connectivity between linked containers works.

### Actual Behavior ###
Some containers are unable to connect.
`docker exec` into the destination container and running "telnet 172.17.0.3 3306" sometimes leads to the container being reachable again.

### Reproduction Steps ###
Unclear

### Other Information ###
dghubble commented 6 years ago

Possibly related to #2443

lucab commented 6 years ago

I'm not entirely sure this is related to that MTU-from-DHCP bug, the original report seems to mention node-local connectivity problems.

@philhug can you please provide:

lucab commented 6 years ago

I think this is likely related to https://groups.google.com/forum/#!topic/coreos-user/FSqBD-R_PPI, i.e. a missing modprobe br_netfilter.

lucazz commented 6 years ago

Hello @lucab, Unfortunately that issue is still going. docker version: https://gist.github.com/lucazz/1585a51845c1eb465827a18c6b70030e Here's some context (you've requested @dghubble on his issue): journalctl -u systemd-networkd: https://gist.github.com/lucazz/9f7e1feb41be26075cb9596bded4466f networkctl status -a: https://gist.github.com/lucazz/477f92f2c39d2ec8bf503c9412a04f7e

lucazz commented 6 years ago

In this case, containers are able to talk to each other but not w/ the web:

core@ip-10-33-29-37 ~ $ docker run alpine ping -c 5 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes

--- 8.8.8.8 ping statistics ---
5 packets transmitted, 0 packets received, 100% packet loss
core@ip-10-33-29-37 ~ $
lucab commented 6 years ago

@lucazz from your logs it looks like your setup also entails flannel and calico. If that is the case, it's a bit of a complex setup to debug (and perhaps unrelated to the original report). A reproducer on a cleaner node would be helpful, because I suspect your issue is due to either your specific configuration (e.g. security groups, iptables or calico policy) or in one of those higher level components.

lucazz commented 6 years ago

Heya @lucab,

The only reason that leads me to believe that that's not the issue is that in this same cluster, with this same canal configs, we have a couple of AWS Ubuntu Deep Learning instances in the cluster that work just fine: https://gist.github.com/lucazz/606c6e3bfe8c4bfe5475e7729214884c

Ubuntu's docker version: https://gist.github.com/lucazz/87745dc3b0451decd14df3ef98146c8d

Ubuntu's networkctl status -a: https://gist.github.com/lucazz/cc1aa38386b2b9ccf23c77082f7333d2

Ubuntu's journalctl -u systemd-networkd: https://gist.github.com/lucazz/d19e3161b7b46080a717f7499393e6ab