Docker network connection time outs to host over time

rg9400 commented 3 years ago

[x] I have tried with the latest version of my channel (Stable or Edge)
[x] I have uploaded Diagnostics
Diagnostics ID: 7E746511-651C-4A74-8C84-91189E8962C1/20201006161122

Expected behavior

I would expect services running inside Docker containers in a WSL backend to be able to reliably communicate with applications running on the host, even with frequent polling

Actual behavior

Due to https://github.com/docker/for-win/issues/8590, I have to run some applications that require high download speeds on the host. I have multiple applications inside Docker containers running inside a Docker bridge network that poll this application every few seconds. When launching WSL, the applications are able to communicate reliably, but this connection deteriorates over time, and after 1-2 days, I notice frequent connection timed out responses from the application running on the host. Running wsl --shutdown and restarting the Docker daemon fixes the issue temporarily. Shifting applications out of Docker and onto the host fixes their communication issues as well. It may be related to the overall network issues linked above.

To be clear, it can still connect. It just starts timing out more and more often the longer the network/containers have been up.

Information

Windows Version: 2004 (OS Build 19041.508)
Docker Desktop Version: 2.4.1.0 (48583)
Are you running inside a virtualized Windows e.g. on a cloud server or on a mac VM: No

I have had this problem ever since starting to use Docker for Windows with the WSL2 backend.

Steps to reproduce the behavior

Run an application on the Windows host. I tried with NZBGet (host ip: 192.168.1.2)
Poll this application from within a Docker container inside a Docker bridge network living within WSL2. I polled 192.168.1.2:6789 every few seconds
Check back in a day to see if the connection is timing out more frequently.
Restart WSL/Docker daemon, notice that the connection is suddenly more reliable though it will begin to deteriorate over time again

gregfrog commented 1 year ago

Thanks for the update, and I will file a support request. Apart from anything else, their answer ignores that this is an issue for users who aren't on Windows. Just restarting Docker Desktop all the time isn't an acceptable workaround IMO.

I suspect I am running into this at the moment. If having to restart the VM that Docker is running in, rebooting in essence, is not a blocker, what is? Hardware damage?

tristanbrown commented 1 year ago

This is absolutely a blocker for me, as I cannot run scheduled tasks reliably.

roele commented 1 year ago

The following workaround resolved the issue for me https://emerle.dev/2022/05/06/the-nasty-gotcha-in-docker/

acedanger commented 1 year ago

The following workaround resolved the issue for me https://emerle.dev/2022/05/06/the-nasty-gotcha-in-docker/

Adding an archive in case the post or site goes down.

https://archive.ph/fk6dC

nk9 commented 1 year ago

While this is useful information, I am not sure that it's actually related to this bug. The error described in the post is "Connection reset by peer." However, the problem in this issue is "Connection timed out." The exact error may differ depending on which software you're using, but the key thing is that you send packets that just never arrive. The connection isn't reset, it just stops moving data and effectively becomes /dev/null.

There are reproduction steps here, and I'm happy to be proven wrong. If someone can run the Python reproduction above and confirm that the problem doesn't occur on recent versions of Docker Desktop with the idle time set to 0, then I'll stand corrected. But @rg9400 spoke with Docker themselves, who acknowledged the problem and said they didn't have a fix. If the solution was as easy as changing vpnKitMaxPortIdleTime, surely they would have mentioned that.

If you would like changes in the behavior of vpnKitMaxPortIdleTime, I suggest you open a different issue.

robertnisipeanu commented 1 year ago

I also replied a few months ago with that fix, and my problem was a connection time out for an nginx reverse proxy and PING command, not a connection reset.

tristanbrown commented 1 year ago

I'm thinking this is a port saturation issue, similar to what's described here. I recently restarted my Docker service, but once the problem crops up again, I'll try going through some of these troubleshooting steps.

BenjaminPelletier commented 1 year ago

I'm about 90% sure this issue applies to me as well, but it's devilishly difficult to tell for sure. I'll refer to a tool for reproduction that I wrote in my observations below:

The issue appears to happen about once every 10^1 continuous integration invocations on a project I work on, and each continuous integration run probably has 10^3-10^4 HTTP requests sent between containers on the same GitHub Actions Linux cloud VM
The issue also happens on my development machine, a laptop with MacOS Ventura 13.2.1
All requests I have observed this issue with have been addressed to host.docker.internal, perhaps mainly because nearly all of my requests are addressed there, but while troubleshooting I was unable to reproduce when sending requests to an IP (using Docker's default bridge network) nor a service name (using a custom bridge network created for the purpose) -- see the reproduction repo for more notes.
The rate of occurrence varies a lot, and not according to any pattern I've been able to identify. The past week, I've had a connection timeout within 10^1-10^2 requests on my development machine with that rate persisting through a laptop reboot. After creating a Docker network to (unsuccessfully) attempt reproduction with containers communicating through that network, not only did the issue not occur using the custom bridge network, but the issue also vanished entirely -- my 100% reliable method of reproduction went to 0%.
The issue does not depend on long handlers; I could reliably (at a ~10% rate) reproduce the issue sending queries to an unconfigured nginx container
The issue does not depend on long timeouts; my simple reproduction used 5-second timeouts
The issue does not depend on a long-running container; I could rm -f the client+server containers, start a new client container with a slightly different image, and have the issue reproducing within the first 100 requests at one time on my laptop
The issue does not depend on external network traversal; all my observations have been for requests between containers on the same system using host.docker.internal.

mirrorspock commented 1 year ago

We are running Docker version Docker version 20.10.22, build 3a2c30b on Ubuntu 22.04.2 LTS and are experiencing the same issue.

We are running a node-red flow which queries a mssql server every 5 minutes, and randomly the connection to the SQL server just gets a 30000ms timeout, the next attempt will be successful..

tutcugil commented 1 year ago

We are experiencing same issue, almost every 10 minutes, SQL queries from our containers getting slower, then it resolves until the next 10 minute period.

Docker Desktop version v4.17.0 Windows Server 2022 - WSL2 1.0.3.0 backend

is there any update on this?

rhux commented 1 year ago

The following workaround resolved the issue for me https://emerle.dev/2022/05/06/the-nasty-gotcha-in-docker/

I had also been experiencing this for several months. Doing this workaround appears to have fixed the issue.

ganeshkrishnan1 commented 1 year ago

Got this issue with windows 11 on WSL and Docker version 23.0.3, build 3e7cbfd

We are running to server so this error becomes untenable.

nk9 commented 1 year ago

Please note that an experimental build of vpnkit has been released in this parallel issue which attempts to resolve what may be the underlying problem here. Users experiencing this should install the experimental builds if possible and feed back to @djs55 in the vpnkit issue as to whether the problem is resolved, and if you notice any side effects.

rg9400 commented 1 year ago

Per my testing of the experimental build, the issue is significantly improved but not resolved. There are still timeouts, just a lot less. When running thousands of curls, I still notice stuck handshakes that don't instantly close but take a minute or two to resolve. The difference is that most such instances do clear out before the timeout.

I just still wanted to confirm that the connections still are getting stuck even if the overall symptoms are a lot better

Junto026 commented 7 months ago

I believe I am facing this same problem on MacOS Sonoma 14.1.1, running Docker Desktop for Mac (Apple Silicon) 4.25.2.

I want to try downgrading to 4.5.0 (it's insane the issue is going on that long). Does anybody have an install file? The oldest available here is 4.9.1.

EDIT: Docker Desktop for MacOS (Apple Silicon) can be downloaded here.

EDIT2: Confirmed, downgrading fixed the issue. I’ve been running with stable connections for weeks now.

sorcer1122 commented 1 week ago

Facing the same issue on Debian 12. Checked ufw logs and whitelisted container's IP address with sudo ufw allow from 172.17.0.2, this fixed it.

docker / for-win