Exposed ports become unresponsive after heavy load

micdah commented 7 years ago

Michael Friis directed me to submit an issue here (see issue 30400 for more)

I am experiencing an intermittent issue with Docker for Windows, where suddenly all the exposed ports become unresponsive, no connection can be made to the containers. This happens when a lot of activity is put on the containers from the host machine, I am running 4 containers and 11 services on the host machine as well as a handful of websites and API's which all interact with the containers.

How to reproduce

As requested by Michael Friis, I have made some sample code which seems to be able to reproduce the issue. You can see and clone the code here github.com/micdah/DockerBomb. I have also made a YouTube video where I demonstrate the issue using my sample code youtube.com/watch?v=v5k1D60h0zE

I have described how to use the program in the readme.md file in the github repo. Note that it might take anywhere from a few minutes to minutes before the issue triggers, it is somewhat random - likely because it is tightly timing related

The sample program creates the requested number of threads, each creating a single connection to the redis container and issuing as many commands as possible until the connection fails.

As demonstrated, when the issue has occurred the container becomes unresponsive on the exposed ports, although it is still running. Trying to restart the container results in an input/output error when trying to bind to the host port. In my previous issue report (30400) I have also included a netstat dump to show that it is not because the port is reserved, when trying to restart the container, that it fails.

Expected behavior

I would expect the container to continue to be accessible via the exposed ports, as long as it is running. If some resource pool (handles, connection pool, etc.) is exhausted, I would expect the container to become responsive again when the resources become available again (for example when stopping the heavy load on the container).

Information

Diagnostic ID This is a diagnostic uploaded, just after the issue has occurred, reproduced as described above.

30667474-C49F-4185-B957-3A7AE1F38393/2017-01-24_21-44-30

Output of docker version

Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Wed Jan 18 16:20:26 2017
 OS/Arch:      windows/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Wed Jan 18 16:20:26 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info

Containers: 5
 Running: 1
 Paused: 0
 Stopped: 4
Images: 6
Server Version: 1.13.0
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.4-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.837 GiB
Name: moby
ID: 5DLJ:7BM4:KTMA:L5UV:ACM5:HJQP:V2W3:ZQXJ:LUS5:XEVE:FJK2:KH5K
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 21
 Goroutines: 28
 System Time: 2017-01-24T20:49:08.8436128Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

mattjanssen commented 7 years ago

Same issue with Win 10 version 1703 build 15063.296 and Docker edge 17.05.0-ce-win11 (12053). No matter what random (unused) port combinations I used, I got the same error.

I fixed it by stopping Docker from the system tray and restarting it. After I fixed the problem I created diagnostic 950F6894-7F6D-4081-BDCE-7B35E19A391B/2017-05-30_16-55-11.

C:\Users\Matt>docker run -p "10392:13293" agaveapi/beanstalkd-console
docker: Error response from daemon: driver failed programming external connectivity on endpoint 
hopeful_mcnulty (fa135ff9192e4bd4f103e5f6128863d174b426483463d32558f332440d5865a4): 
Error starting userland proxy: 
mkdir /port/tcp:0.0.0.0:10392:tcp:172.17.0.2:13293: input/output error.

alarys commented 7 years ago

Hi, I'm experiencing similar problems as those above. I submitted a diagnostic A diagnostic was uploaded with id: 7FF77AD4-6196-4C0C-BF18-962C00826605/2017-06-06_14-05-31

My Docker version 17.03.1-ce, build c6d412e

I tried to use the latest Edge release, but my docker configurations throw errors when I try to create containers. It seems to be complaining about local drive mappings. Not sure what is going on there.

My containers are not responding as above. As soon as a download begins, and the data transfer ramps up to 1-2Mbps, all my containers stop responding. Restarting docker gets things working again.

I have mitigated the problem somewhat by throttling bandwidth. But even throttling bandwidth to 500 kbps, the problem still resurfaces after a while. I can reliably reproduce by not throttling the bandwidth and kicking off a download.

I'm really quite disappointed with how docker on windows is handling large data throughput. And this seems like an issue that others have experienced too.

tparikka commented 7 years ago

@djs55 @jeanlaurent There hasn't been any input from the Docker team on this issue since April 21. I'm hopeful that if there hasn't been any progress that the community may be able to help by trying out builds or providing additional diagnostics. Thank you!

mikesnare commented 7 years ago

@djs55 @jeanlaurent It's now been close to 4 months with no developer response to what could be argued is a pretty serious -- crippling -- bug. Any updates? I'm using Docker for Windows to spin up a zookeeper and a couple kafka instances and it dies pretty consistently under load and then fails during restart with the same errors others are describing, forcing me to restart docker entirely.

axxag commented 7 years ago

Same issue here, uploaded diagnostic: 0CC3ABDF-040B-4BF0-9D39-B24CAE24F6ED/2017-08-10_19-47-42

Here's the interesting stuff

[19:37:29.451][VpnKit         ][Info   ] Tcp.PCB: ERROR: thread failure; terminating threads and closing connection

[19:37:29.452][VpnKit         ][Error  ] vpnkit.exe: Lwt.async failure (Invalid_argument Lwt.wakeup_result): Raised at file "format.ml", line 241, characters 41-52

[19:37:29.452][VpnKit         ][Info   ] Called from file "format.ml", line 482, characters 6-24

[19:37:29.452][VpnKit         ][Info   ] 

[19:40:29.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:31.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:35.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:43.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:59.857][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:41:07.180][VpnKit         ][Error  ] Process died

jinh-dk commented 7 years ago

I have the same issue in docker version 17.06.1-ce-win24 (13025) , as welll as last version.

when I execute docker-compose in powershell console, I saw WindowsError: [Error 2] The system cannot find the file specified: u'***********************' Failed to execute script docker-compose

Have seen in docker daemon log.
[08:50:59.912][VpnKit ][Error ] vpnkit.exe: Hvsock.read: An established connection was aborted by the software in your host machine.

smellinet commented 7 years ago

Hello, I have the same issue with latest version : Version 17.06.2-ce-win27 (13194) Channel: stable 428bd6c After a heavy load , the network of container is broken. the stop/start of container doesn't solve the problem:

Error response from daemon: driver failed programming external connectivity on endpoint tapo (23bc1c5ec134f7b164eb6c35e810cd89e876d8c8da3b46db4d8685b642f8ac8d): Error starting userland proxy: mkdir /port/tcp:0.0.0
.0:5500:tcp:172.17.0.2:5500: input/output error

diagnostic id upload : Diagnostics successfully uploaded (C64A9176-3C73-4FBC-B4FA-D4B0017B689C/2017-09-07_10-18-23).

Naragato commented 7 years ago

I can't believe this still isn't a priority to fix. :(

mittork commented 7 years ago

@djs55 , can you provide any ETA for this to be fixed? So far Docker for Windows is not usable in a productive way for us and we have to think about workarounds (like using another standalone Host and configure docker to client connect to this).

But I ask: How can I trust a software for production, which is not able to handle a bit more load in development stage. I know it is related to the VPN-kit, but anyway....

tparikka commented 7 years ago

@djs55 @jeanlaurent Is there any more information available on this problem, an ETR, or even an updated priority?

TheFamilyRoom commented 6 years ago

We are experiencing the same behavior. heavy load on single port and the docker bridge falls over. the containers are still running but can't be accessed. it seems like this is not a priority for anyone to address but it is holding us up. proposed solurions:

run on mac/linux - we will try this next run less load? - sorta defeats the point.

anyone else have success getting this to work on Win 10?

micdah commented 6 years ago

Yeah I have more or less given up on running heavy loads on Docker for Windows, interestingly I don't seem to have the same issues after we are moving our services over onto Kubernetes running via minikube on windows.

Naturally this environment is just an extra stack on top of Docker, but it seems like Minikube at least, runs "better" on Windows (using Hyper-V, but it is also possible to use VirutalBox).

SC7639 commented 6 years ago

I'm still experiencing this issue now and again. It happened today and I had to restart docker for windows for a container to use the port again.

tparikka commented 6 years ago

@djs55, @jeanlaurent, can you comment on whether or not this issue been officially abandoned?

djs55 commented 6 years ago

@tparikka We've not abandoned the issue, but unfortunately other issues have been higher priority recently -- I apologise for the delay.

We're hoping to update the version of the Linux kernel we use to 4.14, which has a newer implementation of Hyper-V sockets which we use for exposing ports. We should be able to drop some of the workarounds for bugs in the previous version and hopefully this will make the whole system more reliable. As part of this update we'll do some general stress testing and attempt to reproduce this issue.

Thanks again for your patience.

SC7639 commented 6 years ago

Thanks for the update

Michal-Svoboda commented 6 years ago

We suffer the same issue in our project as well. @djs55 - I would like to ask you, if there is any schedule when there will be the new version of Docker available using the newer implementation of Hyper-V sockets?

And what is the current status of this issue?

Thanks a lot.

vohtaski commented 6 years ago

Same problem here. Running MariaDb in a docker container on Windows. After several thousand requests, it dies with "dial tcp 127.0.0.1:33061: getsockopt: connection refused" Would be amazing to have a fix or a workaround

tparikka commented 6 years ago

@djs55 @jeanlaurent I wanted to check in on this since it's been about 4 months. Is there any update on this issue, and perhaps is there a separate Git issue that's been logged for the Linux kernel version update that you hope will improve stability under load so we can follow it?

sw-carlin commented 6 years ago

This problem seems to have improved in the stable channel, as I'm on Docker version 18.03.1-ce and am able to still run docker commands when the exposed ports of my containers aren't responsive; In the previous version that was not possible.

I am also able to recover from the situation by stopping some of the containers which I guess is freeing up frozen sockets? I'm running 20 containers that compose a microservice ecosystem with lots of traffic moving between them and can trigger the situation by running any of my system integration tests. I will try running the tests from inside the container composition to see if that is a good workaround.

tparikka commented 6 years ago

@djs55, @jeanlaurent it has been over 7 months since the last update. Is there any further information on this issue?

djs55 commented 6 years ago

@tparikka sorry for the delay. There has been some progress: we've started updating the Hyper-V socket implementation used in several of the components to remove a complex (possibly buggy) workaround for bugs in old Windows builds (< 14393). Once this is done we'll update the Hyper-V socket GUIDs that we use and then we can bump the kernel version. These changes will be merged into the development branch gradually -- I'll let you know when there are interesting development builds you can test.

tg73 commented 6 years ago

I've also run into the same or possibly a related issue, in this case using Windows Containers hosted on Windows Server Core 1803. The image is based on jetbrains/teamcity-agent - so the container acts as a build agent for TeamCity. When running a build via the agent running within the container, at some arbitrary point, the container becomes unresponsive. With process isolation, RDP to the host OS also becomes unresponsive and the host eventually reboots. With hyperv isolation, the container becomes unresponsive and then stops, but the host OS stays up and responsive. Builds do sometimes complete, but more often than not they fail. TeamCity server reports a loss of connection to the build agent, and eventually the build is marked as failed.

Having invested quite a lot of time getting the image to have all the tools our builds need, it was disappointing (to say the least) that what seems to be a fundamental virtualization issue renders this approach unusable. In the end I've had to revert to individual Windows Server VMs per agent.

Unfortunately, I don't have further time to fully log this problem and try to produce a minimal test case - so my apologies for not logging a full issue report. I have attached my custom Dockerfile for interest. Just to note also that the lack of --cpus support with docker service create is also a big problem with this use case.

issue.zip

tparikka commented 5 years ago

@djs55 Has there been a release that we could play with to test for improvements?

djs55 commented 5 years ago

@tparikka: there are some changes to port forwarding and Hyper-V sockets in today's stable release candidate build: https://download.docker.com/win/stable/29211/Docker%20for%20Windows%20Installer.exe -- this build is probably worth testing. Let me know if you get a chance to try it!

tparikka commented 5 years ago

I ran my Selenium tests successfully against the release candidate build a few times and it didn't blow up. Since I first posted to this issue though I've migrated my test assemblies to .NET Core 2.1 and the underlying framework to .NET Standard 2.0, so my test environment isn't quite the same as when I started looking at it. I'd be interested to hear if others also see the issue resolved - @micdah, @TheFamilyRoom, @smellinet, others who have reported the issue any chance you could also try the new build and let us know what results you see?

JPMoresmau commented 5 years ago

I see to get this issue only since recently (like update of Docker for Windows), never happened before. Diagnostics FBC58536-F77C-4909-9BBE-918AA324B487/20181213174347

Symptoms are the same: after a few minutes of activity containers are not reachable from localhost.

So it looks 29211 actually broke my setup, that was working fine before.

llenrup commented 5 years ago

I only recently started getting related issue see #3108 It was working before the last update.

tparikka commented 5 years ago

It took longer after the prerelease provided by @djs55, but I have run into the same issue again:

ERROR: for selenium-hub Cannot start service selenium-hub: driver failed programming external connectivity on endpoint selenium-hub (7f8d436acd812ee3d7ed9e96f1591f5b2fcb0882adb095feeb02bd4e861342ad): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:4444:tcp:172.19.0.2:4444: input/output error ERROR: Encountered errors while bringing up the project.

Docker Engine 18.09.0 Compose: 1.23.2 Docker Desktop 2.0.0.0-win81 Windows 10 Build 1809 x64

djs55 commented 5 years ago

I've fixed a number of bugs in the port-forwarding code which should make it more stable after load. If you'd like to try an early version of them I have put links to development builds here: https://github.com/docker/for-win/issues/3257#issuecomment-461563065

Let me know if this makes things any better. Thanks for your patience with this issue!

tparikka commented 5 years ago

@djs55 I can't run a deployment build but I have switched to the edge channel on my local instance to look for updates. Which edge release do you anticipate will get the update, or can you post here when it goes live?

tparikka commented 5 years ago

I ran into the issue on this 2.0.2.1-Edge release:

Creating selenium-hub ... error

ERROR: for selenium-hub Cannot start service selenium-hub: driver failed programming external connectivity on endpoint selenium-hub (c90bbfd6b8f3f449da754229ffd9082e1d28112dc28c0c835ec70f591ae36ef3): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:4444:tcp:172.18.0.2:4444: input/output error

ERROR: for selenium-hub Cannot start service selenium-hub: driver failed programming external connectivity on endpoint selenium-hub (c90bbfd6b8f3f449da754229ffd9082e1d28112dc28c0c835ec70f591ae36ef3): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:4444:tcp:172.18.0.2:4444: input/output error ERROR: Encountered errors while bringing up the project.

Engine 18.09.2 Compose 1.24.0-rc1 Version 2.0.2.1 31274 edge

bennettellis commented 5 years ago

Running a commercial piece of software in a container, exercising the REST API repeatedly. Really frustrating bug. I hit the API maybe 50 times and down goes the port. Recreating the docker container doesn't fix, so it's not the http server on the container causing it (e.g. rate limiting). Disabling firewall, anti-virus etc doesn't fix, so they don't seem to be an issue. Only fix is to restart docker for windows. Oddly enough, the host still reports through PortQryV2 that the localhost port is still being listened to, but any attempt to go the next network layer to communicate with the http endpoint in the container just hangs. Extremely consistent behavior. Very frustrating. I may go ahead and try #3257 related fixes in dev release to see if it helps with my issue and help get this tested and released. Thanks!

bennettellis commented 5 years ago

Just moved to edge v 2.0.3.0 (31778) and voila no more port hanging for me at least. Docker --version reports Docker version 18.09.3, build 774a1f4 for the record. Seems fixed. Thanks!!

michaeladada commented 5 years ago

Since the issue seems to be with the vpnkit, you can bypass it and connect directly to the MobyLinuxVM. Instead of "localhost" use "10.0.75.2". This is the default IP assigned to the MobyLinuxVM. You can see what the IP is by running the following commands:

docker run -it --privileged --pid=host justincormack/nsenter1 /bin/sh
ifconfig

The relevant IP is the one assigned to hvint0

Koricz commented 5 years ago

Hi, I have similar problem.

Windows 10 local installation: Apache, PHP, MySQL database, Elasticsearch, Rabbit MQ Linux docker containers: Elasticsearch, Rabbit MQ

Results when running RabbitMQ consumer written in PHP which should index cca 50 000 objects / rows from MySQL to Elasticsearch database:

1) Local installation only (without Docker) - All 50 000 objects are processed 2) With actual release (stable) Docker version - cca 400 request processed then Docker must be restarted to accept any new incoming connections. 3) With Docker 2.0.4.1 (34207) edge - cca 400 request processed, then connection is reset / closed => PHP script is terminated, but containers still accepts any new incoming connection and no Docker restart is needed.

In log - v2.0.4.1 (34207) there are messages like:

[00:12:50.383][ApiProxy       ][Info   ] time="2019-05-17T00:12:50+02:00" msg="proxy << GET /v1.25/containers/808a870d6de247ccce06f46e797e49e90bf979cd7e4aff411571636c5e3ca6a2/json (2.0005ms)\n"
[00:12:50.384][ApiProxy       ][Info   ] time="2019-05-17T00:12:50+02:00" msg="proxy << GET /v1.25/containers/8dd4675e193b48d41c8062da9cc493a87ec5c70d77654949960b4727420770eb/json (2.0034ms)\n"
[00:35:21.999][VpnKit         ][Error  ] vpnkit.exe: tcp:0.0.0.0:9200:tcp:172.19.0.3:9200 proxy failed with flow proxy a: attempted to write to a closed flow
[00:36:46.562][VpnKit         ][Error  ] vpnkit.exe: tcp:0.0.0.0:5672:tcp:172.19.0.4:5672 proxy failed with flow proxy a: attempted to write to a closed flow
[00:47:20.992][VpnKit         ][Error  ] vpnkit.exe: Socket.tcp:127.0.0.1:56170.write TCPv4: caught An established connection was aborted by the software in your host machine.
[00:47:20.992][VpnKit         ][Info   ]  returning Eof

tparikka commented 5 years ago

I encountered this defect again today: ERROR: for selenium-hub Cannot start service selenium-hub: driver failed programming external connectivity on endpoint selenium-hub (e1d864b70783f0b77693bd56cb43ca9176531b72979993550d303f73c143571b): Error starting userland proxy: ERROR: Encountered errors while bringing up the project.

@djs55 Can you speak to the ongoing issues folks are having?

EDIT: I'm on docker Desktop 2.0.4.1 build 34207 edge channel, engine 19.03.0-beta3, compose 1.24.0

docker-robott commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30d of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

tparikka commented 5 years ago

/remove-lifecycle stale

docker-robott commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30d of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

tparikka commented 4 years ago

/remove-lifecycle stale

@djs55, anything here?

docker-robott commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30d of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

tparikka commented 4 years ago

Bumping this, @djs55.

/remove-lifecycle stale

djs55 commented 4 years ago

Does anyone have a current repro case that works with a recent stable version of Docker Desktop that they could share with me?

Thanks in advance!

tparikka commented 4 years ago

I had run into it a while ago with Selenium but it's a super unreliable reproduction vector. I tried taking the DockerBomb project and updating it to .NET Core 3.1 but I'm not super versed in Redis so I'm not sure I'm using it right to try and reproduce the issue. @micdah I don't suppose you'd be able to take a look? I did push my revision to the project up to github.com/tparikka/DockerBomb.

micdah commented 4 years ago

@tparikka Just merged your fork into my repo and verified that the code compiles and runs. Alas I am not anymore working from a Windows machine so I can’t say whether the program will still show the bug or not - but for anyone running on windows, they could try it out with a few thousand “bombs” and see if it still suddenly dies as it did over three years ago.

To run the code, in short, do:

docker-compose up -d
dotnet run —project DockerBomb

tparikka commented 4 years ago

@djs55 I just ran the updated DockerBomb app on my machine (i5-7600K OC@3.8GHz, 16 GB RAM) and despite maxing out my CPU and hitting 3000 threads connecting to a Redis container in Docker I wasn't able to reproduce the issue. It seems to be working on my end. I'm inclined to suggest we leave this issue open long enough for other more recent participants (such as @Koricz) to respond if they have a reproducible scenario for this issue, and let the stalebot close it out if no one responds.

jhnns commented 4 years ago

@tparikka What Docker engine version did you use?

tparikka commented 4 years ago

@jhnns I'm on Docker Desktop 2.2.0.4 Stable Engine 19.03.8 on Windows 10 Pro Version 1909.

jk2K commented 4 years ago

I encountered the same problem, I solved it, you can try adjusting open files and max user processes

# The maximum number of open file descriptors
ulimit -Sn 65535
# The maximum number of processes available to a single user
ulimit -Su 100000

docker / for-win