coreos / bugs

Issue tracker for CoreOS Container Linux
147 stars 30 forks source link

Connectivity issues between docker containers in 1745.4.0 #2442

Open philhug opened 6 years ago

philhug commented 6 years ago

Issue Report


Since updating from 1688.5.3 to 1745.4.0 we experienced connectivity issues between docker containers and also from the toolbox. After rollback to 1688.5.3 the problem disappeared.

[root@coreos-4-p ~]# telnet 3306
telnet: connect to address Connection refused
` ``
excerpt from docker inspect:
    "NetworkSettings": {
        "Bridge": "",
        "SandboxID": "6421f0b72d7365b272a8ff29a08c539bd5446544ca173a38f8e0c0c2dae80844",
        "HairpinMode": false,
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "Ports": {
            "3306/tcp": null
        "SandboxKey": "/var/run/docker/netns/6421f0b72d73",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null,
        "EndpointID": "f39ae5bd585eabde5888cc615465ead53d7f6ccdfbb201a950aa4c1ee85411c8",
        "Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "IPAddress": "",
        "IPPrefixLen": 16,
        "IPv6Gateway": "",
        "MacAddress": "02:42:ac:11:00:03",
        "Networks": {
            "bridge": {
                "IPAMConfig": null,
                "Links": null,
                "Aliases": null,
                "NetworkID": "6eae708ea144820580570e835d9085a5697e9f3645e507df958a35ed689e8927",
                "EndpointID": "f39ae5bd585eabde5888cc615465ead53d7f6ccdfbb201a950aa4c1ee85411c8",
                "Gateway": "",
                "IPAddress": "",
                "IPPrefixLen": 16,
                "IPv6Gateway": "",
                "GlobalIPv6Address": "",
                "GlobalIPv6PrefixLen": 0,
                "MacAddress": "02:42:ac:11:00:03",
                "DriverOpts": null
### Container Linux Version ###

$ cat /etc/os-release (back on working NAME="Container Linux by CoreOS" ID=coreos VERSION=1688.5.3 VERSION_ID=1688.5.3 BUILD_ID=2018-04-03-0547 PRETTY_NAME="Container Linux by CoreOS 1688.5.3 (Rhyolite)" ANSI_COLOR="38;5;75" HOME_URL="" BUG_REPORT_URL="" COREOS_BOARD="amd64-usr"

### Environment ###

What hardware/cloud provider/hypervisor is being used to run Container Linux?

### Expected Behavior ###
Connectivity between linked containers works.

### Actual Behavior ###
Some containers are unable to connect.
`docker exec` into the destination container and running "telnet 3306" sometimes leads to the container being reachable again.

### Reproduction Steps ###

### Other Information ###
dghubble commented 6 years ago

Possibly related to #2443

lucab commented 6 years ago

I'm not entirely sure this is related to that MTU-from-DHCP bug, the original report seems to mention node-local connectivity problems.

@philhug can you please provide:

lucab commented 6 years ago

I think this is likely related to!topic/coreos-user/FSqBD-R_PPI, i.e. a missing modprobe br_netfilter.

lucazz commented 6 years ago

Hello @lucab, Unfortunately that issue is still going. docker version: Here's some context (you've requested @dghubble on his issue): journalctl -u systemd-networkd: networkctl status -a:

lucazz commented 6 years ago

In this case, containers are able to talk to each other but not w/ the web:

core@ip-10-33-29-37 ~ $ docker run alpine ping -c 5
PING ( 56 data bytes

--- ping statistics ---
5 packets transmitted, 0 packets received, 100% packet loss
core@ip-10-33-29-37 ~ $
lucab commented 6 years ago

@lucazz from your logs it looks like your setup also entails flannel and calico. If that is the case, it's a bit of a complex setup to debug (and perhaps unrelated to the original report). A reproducer on a cleaner node would be helpful, because I suspect your issue is due to either your specific configuration (e.g. security groups, iptables or calico policy) or in one of those higher level components.

lucazz commented 6 years ago

Heya @lucab,

The only reason that leads me to believe that that's not the issue is that in this same cluster, with this same canal configs, we have a couple of AWS Ubuntu Deep Learning instances in the cluster that work just fine:

Ubuntu's docker version:

Ubuntu's networkctl status -a:

Ubuntu's journalctl -u systemd-networkd: