docker / for-linux

Docker Engine for Linux
https://docs.docker.com/engine/installation/
756 stars 85 forks source link

`docker run` kills network connection to completely independent containers #1034

Open ricklamers opened 4 years ago

ricklamers commented 4 years ago

Expected behavior

Running the docker run hello-world command should not interfere with a different already running container. Specifically another container running a simple web server (using Flask) which handles long running requests.

Actual behavior

In Firefox (77.0.1 (64-bit)) while the request is pending (in this simple reproducible example a simple GET request that sleeps 30 seconds before returning) the connection is dropped when docker run hello-word is executed. The request never completes. On Google Chrome (Version 83.0.4103.61 (Official Build) (64-bit)) this issue does not occur.

We ran into this bug in a much more complicated setting, but we created a minimal example to make reproducing easier.

Steps to reproduce the behavior

Build the container: docker build -t minimal-flask .

Start container with: docker run -p 80:80 minimal-flask

Use Firefox to make a request to this running container (at http://127.0.0.1).

While the request is waiting (e.g. after 5 seconds have passed), run docker run hello-world. Observe the request failing in Firefox.

We also saw this happening while performing other basic Docker operations such as: Ctrl + C'ing out of another container or stopping a different container using docker stop <id>.

main.py

from flask import Flask
import time

app = Flask(__name__)

@app.route('/', methods=['GET'])
def index():
    time.sleep(30)
    return 'Hello', 200

if __name__ == '__main__':
    app.run(host="0.0.0.0", port=80)

Dockerfile

FROM python:3

RUN pip install Flask

COPY main.py .

CMD ["python","main.py"]

Output of docker version:

Docker version 19.03.11, build 42e35e61f3

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 4
  Running: 1
  Paused: 0
  Stopped: 3
 Images: 301
 Server Version: 19.03.11
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.3.0-53-generic
 Operating System: Ubuntu 18.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 24
 Total Memory: 15.58GiB
 Name: rick-System-Product-Name
 ID: V2LS:CE6X:UPQH:KWU2:IEA3:BU4J:STCC:BXRW:J5G7:S5AJ:JKXI:QMTR
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: [redacted]
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.)

thaJeztah commented 4 years ago

Do the docker daemon logs (preferably with the daemon in "debug" mode) and/or system logs show any information about things that happen?

Does your machine happen to have NetworkManager (https://help.ubuntu.com/community/NetworkManager) installed? I know of situations where NetworkManager is quite "greedy" and attempts to try managing network interfaces that are created in other network-namespaces

ricklamers commented 4 years ago

Thanks for helping out here @thaJeztah.

I've attached the docker daemon log and /var/log/syslog that capture the moment when the connection is dropped in Firefox.

var-log-syslog.log docker-dameon-log.log

apt list network-manager shows:

rick@rick-System-Product-Name:/var/log$ apt list network-manager
Listing... Done
network-manager/bionic-updates,now 1.10.6-2ubuntu1.4 amd64 [installed,automatic]
N: There are 2 additional versions. Please use the '-a' switch to see them.

I don't recall ever interacting with/installing network-manager so it's probably what you get by default when you install Ubuntu 18.04.4 LTS.

These errors stood out to me in the logs, but I have no clue what they mean:

Jun 10 15:04:03 rick-System-Product-Name libvirtd[1510]: 2020-06-10 13:04:03.553+0000: 1861: error : virFileReadAll:1420 : Failed to open file '/sys/class/net/veth5f8b8a8/operstate': No such file or directory
Jun 10 15:04:03 rick-System-Product-Name libvirtd[1510]: 2020-06-10 13:04:03.553+0000: 1861: error : virNetDevGetLinkInfo:2530 : unable to read: /sys/class/net/veth5f8b8a8/operstate: No such file or directory
ricklamers commented 4 years ago

What puzzles me is that Firefox has this issue while Chrome doesn't in otherwise the very same situation. Which seemed to point to me that there's some sort of network disturbance that Firefox's network stack decides is enough to drop the connection while Chrome seems to kind of "ignore" the disturbance.

thaJeztah commented 4 years ago

I don't recall ever interacting with/installing network-manager so it's probably what you get by default when you install Ubuntu 18.04.4 LTS.

I suspect it may be installed by default on "desktop" installs, where it makes more sense because it handles (e.g.) switching (WiFi) networks, which would me more "common" on a Laptop than on a server.

These errors stood out to me in the logs, but I have no clue what they mean:

Yes, I've seen such errors in previous issues where NetworkManager was running; what I suspect happens there is that NetworkManager tries to act on every network interface on the machine; containers get their own virtual interface, so when a container is started, NetworkManager tries to take control of that interface, but because it's in the container's namespace, it then fails to find it.

It's possible that because it still detected that interface, it's reconfiguring other interfaces (not sure), which could explain the networking issue.

It's worth trying if (temporarily) disabling NetworkManager solves the issue (I'm not on a Linux machine with NetworkManager installed, but sudo systemctl stop network-manager may work (not sure if it would try to restart itself after that though).

What puzzles me is that Firefox has this issue while Chrome doesn't in otherwise the very same situation. Which seemed to point to me that there's some sort of network disturbance that Firefox's network stack decides is enough to drop the connection while Chrome seems to kind of "ignore" the disturbance.

That's definitely interesting 🤔

I must admit that I'm horrible at networking, so if things get too complicated 😅. Interested to hear though if the above helps.

I recall that network-manager has a configuration option that allows excluding certain interfaces (wondering if there's a "portable" solution for that to exclude the container interfaces, and if that would help for these setups)

ricklamers commented 4 years ago

Using sudo systemctl stop NetworkManager.service (your suggestions also appeared to stop the network-manager) and validating with sudo systemctl status NetworkManager.service that it's off did not result in any change of behavior in Firefox (still drops connection to container), also still works in Chrome.

ricklamers commented 4 years ago

Checking in on this issue. How should we proceed?

Could we do something on our end to prevent the problem from occuring?