Closed micdah closed 4 years ago
Same issue with Win 10 version 1703 build 15063.296 and Docker edge 17.05.0-ce-win11 (12053). No matter what random (unused) port combinations I used, I got the same error.
I fixed it by stopping Docker from the system tray and restarting it. After I fixed the problem I created diagnostic 950F6894-7F6D-4081-BDCE-7B35E19A391B/2017-05-30_16-55-11.
C:\Users\Matt>docker run -p "10392:13293" agaveapi/beanstalkd-console
docker: Error response from daemon: driver failed programming external connectivity on endpoint
hopeful_mcnulty (fa135ff9192e4bd4f103e5f6128863d174b426483463d32558f332440d5865a4):
Error starting userland proxy:
mkdir /port/tcp:0.0.0.0:10392:tcp:172.17.0.2:13293: input/output error.
Hi, I'm experiencing similar problems as those above. I submitted a diagnostic A diagnostic was uploaded with id: 7FF77AD4-6196-4C0C-BF18-962C00826605/2017-06-06_14-05-31
My Docker version 17.03.1-ce, build c6d412e
I tried to use the latest Edge release, but my docker configurations throw errors when I try to create containers. It seems to be complaining about local drive mappings. Not sure what is going on there.
My containers are not responding as above. As soon as a download begins, and the data transfer ramps up to 1-2Mbps, all my containers stop responding. Restarting docker gets things working again.
I have mitigated the problem somewhat by throttling bandwidth. But even throttling bandwidth to 500 kbps, the problem still resurfaces after a while. I can reliably reproduce by not throttling the bandwidth and kicking off a download.
I'm really quite disappointed with how docker on windows is handling large data throughput. And this seems like an issue that others have experienced too.
@djs55 @jeanlaurent There hasn't been any input from the Docker team on this issue since April 21. I'm hopeful that if there hasn't been any progress that the community may be able to help by trying out builds or providing additional diagnostics. Thank you!
@djs55 @jeanlaurent It's now been close to 4 months with no developer response to what could be argued is a pretty serious -- crippling -- bug. Any updates? I'm using Docker for Windows to spin up a zookeeper and a couple kafka instances and it dies pretty consistently under load and then fails during restart with the same errors others are describing, forcing me to restart docker entirely.
Same issue here, uploaded diagnostic: 0CC3ABDF-040B-4BF0-9D39-B24CAE24F6ED/2017-08-10_19-47-42
Here's the interesting stuff
[19:37:29.451][VpnKit ][Info ] Tcp.PCB: ERROR: thread failure; terminating threads and closing connection
[19:37:29.452][VpnKit ][Error ] vpnkit.exe: Lwt.async failure (Invalid_argument Lwt.wakeup_result): Raised at file "format.ml", line 241, characters 41-52
[19:37:29.452][VpnKit ][Info ] Called from file "format.ml", line 482, characters 6-24
[19:37:29.452][VpnKit ][Info ]
[19:40:29.855][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 531271309
[19:40:31.855][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 531271309
[19:40:35.855][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 531271309
[19:40:43.855][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 531271309
[19:40:59.857][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 531271309
[19:41:07.180][VpnKit ][Error ] Process died
I have the same issue in docker version 17.06.1-ce-win24 (13025) , as welll as last version.
when I execute docker-compose in powershell console, I saw
WindowsError: [Error 2] The system cannot find the file specified: u'***********************' Failed to execute script docker-compose
Have seen in docker daemon log.
[08:50:59.912][VpnKit ][Error ] vpnkit.exe: Hvsock.read: An established connection was aborted by the software in your host machine.
Hello, I have the same issue with latest version : Version 17.06.2-ce-win27 (13194) Channel: stable 428bd6c After a heavy load , the network of container is broken. the stop/start of container doesn't solve the problem:
Error response from daemon: driver failed programming external connectivity on endpoint tapo (23bc1c5ec134f7b164eb6c35e810cd89e876d8c8da3b46db4d8685b642f8ac8d): Error starting userland proxy: mkdir /port/tcp:0.0.0
.0:5500:tcp:172.17.0.2:5500: input/output error
diagnostic id upload : Diagnostics successfully uploaded (C64A9176-3C73-4FBC-B4FA-D4B0017B689C/2017-09-07_10-18-23).
I can't believe this still isn't a priority to fix. :(
@djs55 , can you provide any ETA for this to be fixed? So far Docker for Windows is not usable in a productive way for us and we have to think about workarounds (like using another standalone Host and configure docker to client connect to this).
But I ask: How can I trust a software for production, which is not able to handle a bit more load in development stage. I know it is related to the VPN-kit, but anyway....
@djs55 @jeanlaurent Is there any more information available on this problem, an ETR, or even an updated priority?
We are experiencing the same behavior. heavy load on single port and the docker bridge falls over. the containers are still running but can't be accessed. it seems like this is not a priority for anyone to address but it is holding us up. proposed solurions:
run on mac/linux - we will try this next run less load? - sorta defeats the point.
anyone else have success getting this to work on Win 10?
Yeah I have more or less given up on running heavy loads on Docker for Windows, interestingly I don't seem to have the same issues after we are moving our services over onto Kubernetes running via minikube on windows.
Naturally this environment is just an extra stack on top of Docker, but it seems like Minikube at least, runs "better" on Windows (using Hyper-V, but it is also possible to use VirutalBox).
I'm still experiencing this issue now and again. It happened today and I had to restart docker for windows for a container to use the port again.
@djs55, @jeanlaurent, can you comment on whether or not this issue been officially abandoned?
@tparikka We've not abandoned the issue, but unfortunately other issues have been higher priority recently -- I apologise for the delay.
We're hoping to update the version of the Linux kernel we use to 4.14, which has a newer implementation of Hyper-V sockets which we use for exposing ports. We should be able to drop some of the workarounds for bugs in the previous version and hopefully this will make the whole system more reliable. As part of this update we'll do some general stress testing and attempt to reproduce this issue.
Thanks again for your patience.
Thanks for the update
We suffer the same issue in our project as well. @djs55 - I would like to ask you, if there is any schedule when there will be the new version of Docker available using the newer implementation of Hyper-V sockets?
And what is the current status of this issue?
Thanks a lot.
Same problem here. Running MariaDb in a docker container on Windows. After several thousand requests, it dies with "dial tcp 127.0.0.1:33061: getsockopt: connection refused" Would be amazing to have a fix or a workaround
@djs55 @jeanlaurent I wanted to check in on this since it's been about 4 months. Is there any update on this issue, and perhaps is there a separate Git issue that's been logged for the Linux kernel version update that you hope will improve stability under load so we can follow it?
This problem seems to have improved in the stable channel, as I'm on Docker version 18.03.1-ce and am able to still run docker commands when the exposed ports of my containers aren't responsive; In the previous version that was not possible.
I am also able to recover from the situation by stopping some of the containers which I guess is freeing up frozen sockets? I'm running 20 containers that compose a microservice ecosystem with lots of traffic moving between them and can trigger the situation by running any of my system integration tests. I will try running the tests from inside the container composition to see if that is a good workaround.
@djs55, @jeanlaurent it has been over 7 months since the last update. Is there any further information on this issue?
@tparikka sorry for the delay. There has been some progress: we've started updating the Hyper-V socket implementation used in several of the components to remove a complex (possibly buggy) workaround for bugs in old Windows builds (< 14393). Once this is done we'll update the Hyper-V socket GUIDs that we use and then we can bump the kernel version. These changes will be merged into the development branch gradually -- I'll let you know when there are interesting development builds you can test.
I've also run into the same or possibly a related issue, in this case using Windows Containers hosted on Windows Server Core 1803. The image is based on jetbrains/teamcity-agent
- so the container acts as a build agent for TeamCity. When running a build via the agent running within the container, at some arbitrary point, the container becomes unresponsive. With process isolation, RDP to the host OS also becomes unresponsive and the host eventually reboots. With hyperv isolation, the container becomes unresponsive and then stops, but the host OS stays up and responsive. Builds do sometimes complete, but more often than not they fail. TeamCity server reports a loss of connection to the build agent, and eventually the build is marked as failed.
Having invested quite a lot of time getting the image to have all the tools our builds need, it was disappointing (to say the least) that what seems to be a fundamental virtualization issue renders this approach unusable. In the end I've had to revert to individual Windows Server VMs per agent.
Unfortunately, I don't have further time to fully log this problem and try to produce a minimal test case - so my apologies for not logging a full issue report. I have attached my custom Dockerfile
for interest. Just to note also that the lack of --cpus
support with docker service create
is also a big problem with this use case.
@djs55 Has there been a release that we could play with to test for improvements?
@tparikka: there are some changes to port forwarding and Hyper-V sockets in today's stable release candidate build: https://download.docker.com/win/stable/29211/Docker%20for%20Windows%20Installer.exe -- this build is probably worth testing. Let me know if you get a chance to try it!
I ran my Selenium tests successfully against the release candidate build a few times and it didn't blow up. Since I first posted to this issue though I've migrated my test assemblies to .NET Core 2.1 and the underlying framework to .NET Standard 2.0, so my test environment isn't quite the same as when I started looking at it. I'd be interested to hear if others also see the issue resolved - @micdah, @TheFamilyRoom, @smellinet, others who have reported the issue any chance you could also try the new build and let us know what results you see?
I see to get this issue only since recently (like update of Docker for Windows), never happened before. Diagnostics FBC58536-F77C-4909-9BBE-918AA324B487/20181213174347
Symptoms are the same: after a few minutes of activity containers are not reachable from localhost.
So it looks 29211 actually broke my setup, that was working fine before.
I only recently started getting related issue see #3108 It was working before the last update.
It took longer after the prerelease provided by @djs55, but I have run into the same issue again:
ERROR: for selenium-hub Cannot start service selenium-hub: driver failed programming external connectivity on endpoint selenium-hub (7f8d436acd812ee3d7ed9e96f1591f5b2fcb0882adb095feeb02bd4e861342ad): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:4444:tcp:172.19.0.2:4444: input/output error ERROR: Encountered errors while bringing up the project.
Docker Engine 18.09.0 Compose: 1.23.2 Docker Desktop 2.0.0.0-win81 Windows 10 Build 1809 x64
I've fixed a number of bugs in the port-forwarding code which should make it more stable after load. If you'd like to try an early version of them I have put links to development builds here: https://github.com/docker/for-win/issues/3257#issuecomment-461563065
Let me know if this makes things any better. Thanks for your patience with this issue!
@djs55 I can't run a deployment build but I have switched to the edge channel on my local instance to look for updates. Which edge release do you anticipate will get the update, or can you post here when it goes live?
I ran into the issue on this 2.0.2.1-Edge release:
Creating selenium-hub ... error
ERROR: for selenium-hub Cannot start service selenium-hub: driver failed programming external connectivity on endpoint selenium-hub (c90bbfd6b8f3f449da754229ffd9082e1d28112dc28c0c835ec70f591ae36ef3): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:4444:tcp:172.18.0.2:4444: input/output error
ERROR: for selenium-hub Cannot start service selenium-hub: driver failed programming external connectivity on endpoint selenium-hub (c90bbfd6b8f3f449da754229ffd9082e1d28112dc28c0c835ec70f591ae36ef3): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:4444:tcp:172.18.0.2:4444: input/output error ERROR: Encountered errors while bringing up the project.
Engine 18.09.2 Compose 1.24.0-rc1 Version 2.0.2.1 31274 edge
Running a commercial piece of software in a container, exercising the REST API repeatedly. Really frustrating bug. I hit the API maybe 50 times and down goes the port. Recreating the docker container doesn't fix, so it's not the http server on the container causing it (e.g. rate limiting). Disabling firewall, anti-virus etc doesn't fix, so they don't seem to be an issue. Only fix is to restart docker for windows. Oddly enough, the host still reports through PortQryV2 that the localhost port is still being listened to, but any attempt to go the next network layer to communicate with the http endpoint in the container just hangs. Extremely consistent behavior. Very frustrating. I may go ahead and try #3257 related fixes in dev release to see if it helps with my issue and help get this tested and released. Thanks!
Just moved to edge v 2.0.3.0 (31778) and voila no more port hanging for me at least. Docker --version reports Docker version 18.09.3, build 774a1f4 for the record. Seems fixed. Thanks!!
Since the issue seems to be with the vpnkit, you can bypass it and connect directly to the MobyLinuxVM. Instead of "localhost" use "10.0.75.2". This is the default IP assigned to the MobyLinuxVM. You can see what the IP is by running the following commands:
docker run -it --privileged --pid=host justincormack/nsenter1 /bin/sh
ifconfig
The relevant IP is the one assigned to hvint0
Hi, I have similar problem.
Windows 10 local installation: Apache, PHP, MySQL database, Elasticsearch, Rabbit MQ Linux docker containers: Elasticsearch, Rabbit MQ
Results when running RabbitMQ consumer written in PHP which should index cca 50 000 objects / rows from MySQL to Elasticsearch database:
1) Local installation only (without Docker) - All 50 000 objects are processed 2) With actual release (stable) Docker version - cca 400 request processed then Docker must be restarted to accept any new incoming connections. 3) With Docker 2.0.4.1 (34207) edge - cca 400 request processed, then connection is reset / closed => PHP script is terminated, but containers still accepts any new incoming connection and no Docker restart is needed.
In log - v2.0.4.1 (34207) there are messages like:
[00:12:50.383][ApiProxy ][Info ] time="2019-05-17T00:12:50+02:00" msg="proxy << GET /v1.25/containers/808a870d6de247ccce06f46e797e49e90bf979cd7e4aff411571636c5e3ca6a2/json (2.0005ms)\n"
[00:12:50.384][ApiProxy ][Info ] time="2019-05-17T00:12:50+02:00" msg="proxy << GET /v1.25/containers/8dd4675e193b48d41c8062da9cc493a87ec5c70d77654949960b4727420770eb/json (2.0034ms)\n"
[00:35:21.999][VpnKit ][Error ] vpnkit.exe: tcp:0.0.0.0:9200:tcp:172.19.0.3:9200 proxy failed with flow proxy a: attempted to write to a closed flow
[00:36:46.562][VpnKit ][Error ] vpnkit.exe: tcp:0.0.0.0:5672:tcp:172.19.0.4:5672 proxy failed with flow proxy a: attempted to write to a closed flow
[00:47:20.992][VpnKit ][Error ] vpnkit.exe: Socket.tcp:127.0.0.1:56170.write TCPv4: caught An established connection was aborted by the software in your host machine.
[00:47:20.992][VpnKit ][Info ] returning Eof
I encountered this defect again today: ERROR: for selenium-hub Cannot start service selenium-hub: driver failed programming external connectivity on endpoint selenium-hub (e1d864b70783f0b77693bd56cb43ca9176531b72979993550d303f73c143571b): Error starting userland proxy: ERROR: Encountered errors while bringing up the project.
@djs55 Can you speak to the ongoing issues folks are having?
EDIT: I'm on docker Desktop 2.0.4.1 build 34207 edge channel, engine 19.03.0-beta3, compose 1.24.0
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
comment.
Stale issues will be closed after an additional 30d of inactivity.
Prevent issues from auto-closing with an /lifecycle frozen
comment.
If this issue is safe to close now please do so.
Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
comment.
Stale issues will be closed after an additional 30d of inactivity.
Prevent issues from auto-closing with an /lifecycle frozen
comment.
If this issue is safe to close now please do so.
Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale
/remove-lifecycle stale
@djs55, anything here?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
comment.
Stale issues will be closed after an additional 30d of inactivity.
Prevent issues from auto-closing with an /lifecycle frozen
comment.
If this issue is safe to close now please do so.
Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale
Bumping this, @djs55.
/remove-lifecycle stale
Does anyone have a current repro case that works with a recent stable version of Docker Desktop that they could share with me?
Thanks in advance!
I had run into it a while ago with Selenium but it's a super unreliable reproduction vector. I tried taking the DockerBomb project and updating it to .NET Core 3.1 but I'm not super versed in Redis so I'm not sure I'm using it right to try and reproduce the issue. @micdah I don't suppose you'd be able to take a look? I did push my revision to the project up to github.com/tparikka/DockerBomb.
@tparikka Just merged your fork into my repo and verified that the code compiles and runs. Alas I am not anymore working from a Windows machine so I can’t say whether the program will still show the bug or not - but for anyone running on windows, they could try it out with a few thousand “bombs” and see if it still suddenly dies as it did over three years ago.
To run the code, in short, do:
docker-compose up -d
dotnet run —project DockerBomb
@djs55 I just ran the updated DockerBomb app on my machine (i5-7600K OC@3.8GHz, 16 GB RAM) and despite maxing out my CPU and hitting 3000 threads connecting to a Redis container in Docker I wasn't able to reproduce the issue. It seems to be working on my end. I'm inclined to suggest we leave this issue open long enough for other more recent participants (such as @Koricz) to respond if they have a reproducible scenario for this issue, and let the stalebot close it out if no one responds.
@tparikka What Docker engine version did you use?
@jhnns I'm on Docker Desktop 2.2.0.4 Stable Engine 19.03.8 on Windows 10 Pro Version 1909.
I encountered the same problem, I solved it, you can try adjusting open files
and max user processes
# The maximum number of open file descriptors
ulimit -Sn 65535
# The maximum number of processes available to a single user
ulimit -Su 100000
Michael Friis directed me to submit an issue here (see issue 30400 for more)
I am experiencing an intermittent issue with Docker for Windows, where suddenly all the exposed ports become unresponsive, no connection can be made to the containers. This happens when a lot of activity is put on the containers from the host machine, I am running 4 containers and 11 services on the host machine as well as a handful of websites and API's which all interact with the containers.
How to reproduce
As requested by Michael Friis, I have made some sample code which seems to be able to reproduce the issue. You can see and clone the code here github.com/micdah/DockerBomb. I have also made a YouTube video where I demonstrate the issue using my sample code youtube.com/watch?v=v5k1D60h0zE
I have described how to use the program in the readme.md file in the github repo. Note that it might take anywhere from a few minutes to minutes before the issue triggers, it is somewhat random - likely because it is tightly timing related
The sample program creates the requested number of threads, each creating a single connection to the redis container and issuing as many commands as possible until the connection fails.
As demonstrated, when the issue has occurred the container becomes unresponsive on the exposed ports, although it is still running. Trying to restart the container results in an input/output error when trying to bind to the host port. In my previous issue report (30400) I have also included a
netstat
dump to show that it is not because the port is reserved, when trying to restart the container, that it fails.Expected behavior
I would expect the container to continue to be accessible via the exposed ports, as long as it is running. If some resource pool (handles, connection pool, etc.) is exhausted, I would expect the container to become responsive again when the resources become available again (for example when stopping the heavy load on the container).
Information
Diagnostic ID This is a diagnostic uploaded, just after the issue has occurred, reproduced as described above.
Output of
docker version
Output of
docker info