docker / for-win

Bug reports for Docker Desktop for Windows
https://www.docker.com/products/docker#/windows
1.84k stars 281 forks source link

Docker network connection time outs to host over time #8861

Open rg9400 opened 3 years ago

rg9400 commented 3 years ago

Expected behavior

I would expect services running inside Docker containers in a WSL backend to be able to reliably communicate with applications running on the host, even with frequent polling

Actual behavior

Due to https://github.com/docker/for-win/issues/8590, I have to run some applications that require high download speeds on the host. I have multiple applications inside Docker containers running inside a Docker bridge network that poll this application every few seconds. When launching WSL, the applications are able to communicate reliably, but this connection deteriorates over time, and after 1-2 days, I notice frequent connection timed out responses from the application running on the host. Running wsl --shutdown and restarting the Docker daemon fixes the issue temporarily. Shifting applications out of Docker and onto the host fixes their communication issues as well. It may be related to the overall network issues linked above.

To be clear, it can still connect. It just starts timing out more and more often the longer the network/containers have been up.

Information

I have had this problem ever since starting to use Docker for Windows with the WSL2 backend.

Steps to reproduce the behavior

  1. Run an application on the Windows host. I tried with NZBGet (host ip: 192.168.1.2)
  2. Poll this application from within a Docker container inside a Docker bridge network living within WSL2. I polled 192.168.1.2:6789 every few seconds
  3. Check back in a day to see if the connection is timing out more frequently.
  4. Restart WSL/Docker daemon, notice that the connection is suddenly more reliable though it will begin to deteriorate over time again
rg9400 commented 3 years ago

This seems to improve if you use the recommended host.docker.internal option instead of using the IP of the host machine directly

rg9400 commented 3 years ago

Further update on this. While the above does prolong the deterioration, it still eventually happens. After 4-5 days, timeouts start occurring at increasing frequency, with it eventually reaching a point where timeouts are happening on almost every few calls, requiring a full restart of WSL and Docker to function.

markoueis commented 3 years ago

We have the same issue

  1. Using 2.4.0.0
  2. We use host.docker.internal

We have a service running on the host.

If i try to hit host.docker.internal from within a linux container i can always get it to trip up eventually after say 5000 curl requests to http:\host.docker.internal\service (it timesout for one request)

If i try http:\host.docker.internal\service from the host, it works flawlessly even after 10000 curl requests

Sometimes, intermittently, and we can't find out why, it starts to fail much more frequently (like maybe every 100 curl requests)

Something is up with the networking...

Here is a very simple test to show what's going on: ezgif-3-7115a7f3b7ab

markoueis commented 3 years ago

In my limited testing, i created a loopback adapter and used it instead. I created an ip 10.0.75.2 and used it instead. It's much more reliable. It's an ugly work around but it might work at least to help show where the issue might be.

markoueis commented 3 years ago

Hey guys, this is still happening pretty consistently. Is anyone looking at the reliability/performance of these things? Is this the wrong place to post this?

rg9400 commented 3 years ago

I was able to send this via their support and have them reproduce the issue. They diagnosed the cause, but said it would involve some major refactoring, so they didn't have a target fix date. Below is the issue as mentioned by them

I can reproduce the bug now. If I query the vpnkit diagnostics with this program https://github.com/djs55/bug-repros/tree/main/tools/vpnkit-diagnostics while the connection is stuck then I observe: (for my particular repro the port number was 51580. I discovered this using wireshark to explore the trace)

$ tcpdump -r capture\\all.pcap port 51580
15:57:03.021934 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195077730 ecr 0,nop,wscale 7], length 0
15:57:04.064094 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195078771 ecr 0,nop,wscale 7], length 0 15:57:06.111633 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195080819 ecr 0,nop,wscale 7], length 0
15:57:10.143908 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195084851 ecr 0,nop,wscale 7], length 0
15:57:18.464142 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195093171 ecr 0,nop,wscale 7], length 0
15:57:34.848536 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195109555 ecr 0,nop,wscale 7], length 0
15:58:07.103411 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195141811 ecr 0,nop,wscale 7], length 0

which is a stuck TCP handshake from the Linux point of view. The same thing is probably visible in a live trace from docker run -it --privileged --net=host djs55/tcpdump -n -i eth0.

Using sysinternals process explorer to examine the vpnkit.exe process, I only see 1 TCP connection at a time (although a larger than ideal number of UDP connections which are DNS-related I think). There's no sign of a resource leak.

When this manifests I can still establish other TCP connections and run the test again -- the impact seems limited to the 1 handshake failure.

The vpnkit diagnostics has a single TCP flow registered:

> cat .\flows
TCP 192.168.65.3:51580 > 192.168.65.2:6789 socket = open last_active_time = 1605023899.0

which means that vpnkit itself thinks the flow is connected, although the handshake never completed.

markoueis commented 3 years ago

Woah, thanks for this update @rg9400. Glad you got it on their radar. So your work around is to restart docker and wsl --shutdown? I've been trying to use another IP (loopback adapter) as opposed to host.docker.internal or whatever host.docker.internal points to. But I'm not 100% sure that solves the problem permanently. Maybe its just a new IP so it will work for a little and then deteriorate again over time. Based on your explanation of the root cause, that might indeed be the case.

rg9400 commented 3 years ago

Yeah, for now I am just living with it and restarting WSL/Docker every now and then when the connection timeouts become too frequent and unbearable.

markoueis commented 3 years ago

What can we do to get this worked on. Is there work happening on it? or a ticket we can follow? This still bugs us quite consistently.

markoueis commented 3 years ago

I want to keep this thread alive as this is a massive pain for folks especially because they don't know its happening. This needs to become more reliable.

Here is a newer diagnostic id: F4D29FA0-6778-40B8-B312-BADEA278BB3B/20210521171355

Also discovered that just killing vpnkit.exe in task manager reduces the problem. It restarts almost instantly and connections resume much better without having to restart containers or anything. But problem eventually reoccurs.

stormmuller commented 2 years ago

We have about 15 services in our docker-compose file and all of them do an npm install. A cacheless build is impossible because it tries to build all the services at once and the npm install steps timeout because trying to download that many packages just kills bandwidth.

I'm not using the --parallel flag I've set the following environment variables:

But non of this seems to change the behavior

bradleyayers commented 2 years ago

This happens on macOS too, in fact quite reliably after ~7 minutes and ~13,000 requests of hitting a HTTP server:

Server:

$ python3 -mhttp.server 8015

Client (siege):

$ cat <<EOF > siegerc
timeout = 1
failures = 1
EOF
$ docker run --rm -v $(pwd)/siegerc:/tmp/siegerc -t funkygibbon/siege --rc=/tmp/siegerc -t2000s -c2 -d0.1 http://host.docker.internal:8015/api/foo

Output:

New configuration template added to /root/.siege
Run siege -C to view the current settings in that file
** SIEGE 4.0.4
** Preparing 2 concurrent users for battle.
The server is now under siege...[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(1) sock.c:240: Connection timed out
siege aborted due to excessive socket failure; you
can change the failure threshold in $HOME/.siegerc

Transactions:              13949 hits
Availability:              99.99 %
Elapsed time:             378.89 secs
Data transferred:           6.24 MB
Response time:              0.00 secs
Transaction rate:          36.82 trans/sec
Throughput:             0.02 MB/sec
Concurrency:                0.10
Successful transactions:           0
Failed transactions:               1
Longest transaction:            0.05
Shortest transaction:           0.00

What's interesting is that it gets progressively worse from there, the timeouts happen more and more frequently. Restarting the HTTP server doesn't help, but restarting it on another port does (e.g. from 8019 -> 8020). From there you get another 7 minutes of 100% success before it starts degrading again.

I tried adding an IP alias to my loopback adapter and hitting that instead of host.docker.internal but it had the same behavior (i.e. degraded after 7 minutes). The same goes for using the IP (192.168.65.2) and skipping the DNS resolution.

rg9400 commented 2 years ago

This issue remains unresolved. The devs indicated it required major rework, but I haven't heard back from them in 6 months on the progress.

docker-robott commented 2 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

rg9400 commented 2 years ago

/remove-lifecycle stale

zadirion commented 2 years ago

I am also affected by this issue. I thought at one point it was because of TCP keepalive on sockets, and the sockets not being closed as fast as they are opened, thus a exhausting the max number of available sockets. But the problem doesn't go away even if my containers stop opening connections for a while, only a restart of docker and wsl seems to fix this. This issue should be on high priority...

artzwinger commented 2 years ago

I cannot connect from a container to a host port even using telnet. Network mode is bridge, which is default, but "host" mode also doesn't work.

I tried to guess host IP, but also I tried this: extra_hosts:

Telnet connection from host machine to this host port does work well.

In previous Docker versions it was working fine! Seems it's broken since some update maybe from 2021-2022.

artzwinger commented 2 years ago

Upd. It was my Ubuntu UFW that was blocking containers from connecting to host ports.

raarts commented 2 years ago

Having this exact problem on MacOS. Restarting Docker fixes the problem (for a while).

bernhof commented 2 years ago

We have reports of this occurring across teams on Windows and macOS as well. We have no reports of this issue occuring on Linux.

Someone noticed that on macOS, simply waiting ~15mins often alleviates the problem.

metacity commented 2 years ago

We're also experiencing this (using host.docker.internal) on Docker Desktop for Windows. Strangely enough, Docker version up to 4.5.1 seem to work fine, but versions 4.6.x and 4.7.x instantly bring up the problem. Connections work for some time, but then the timeouts start. All checks of "C:\Program Files\Docker\Docker\resources\com.docker.diagnose.exe" check pass.

RomanShumkov commented 2 years ago

I'm experiencing the same problem with increasing amount of timeouts over time while using host.docker.internal.

stamosv commented 2 years ago

I'm also experiencing the same problem. Downgrade to 4.5.1 looks that solves the issue.

levimatheri commented 2 years ago

Any update on this issue? I'm experiencing the same. Restarting the container does not fix it. Only restarting the daemon/host resolves it.

bernhof commented 2 years ago

We seem to have resolved the issue on Windows (but not Mac)

We previously had the following configuration in our compose file to allow containers to reach the host using "host.docker.internal" on Windows, Mac and Linux hosts:

extra_hosts:
- "host.docker.internal:host-gateway"

Removing this configuration resolved the time out issue on Windows (but can obviously cause other problems). Mac users still have time out issues, though.

raeganbarker commented 2 years ago

We are encountering issues with this on MacOS 12.0. We determined that our developers using Docker Desktop 4.3.0 have not encountered the issue. We are currently testing downgrading Docker Desktop to 4.3.0. This seems to have resolved the problem so far. We have not yet tested going all the way back up to 4.5.1 as noted earlier in this thread. We also have not yet observed this issue in Docker on our x86 Ubuntu environments.

jrpope2014 commented 2 years ago

I was able to send this via their support and have them reproduce the issue. They diagnosed the cause, but said it would involve some major refactoring, so they didn't have a target fix date. Below is the issue as mentioned by them

I can reproduce the bug now. If I query the vpnkit diagnostics with this program https://github.com/djs55/bug-repros/tree/main/tools/vpnkit-diagnostics while the connection is stuck then I observe: (for my particular repro the port number was 51580. I discovered this using wireshark to explore the trace)

$ tcpdump -r capture\\all.pcap port 51580
15:57:03.021934 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195077730 ecr 0,nop,wscale 7], length 0
15:57:04.064094 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195078771 ecr 0,nop,wscale 7], length 0 15:57:06.111633 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195080819 ecr 0,nop,wscale 7], length 0
15:57:10.143908 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195084851 ecr 0,nop,wscale 7], length 0
15:57:18.464142 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195093171 ecr 0,nop,wscale 7], length 0
15:57:34.848536 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195109555 ecr 0,nop,wscale 7], length 0
15:58:07.103411 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195141811 ecr 0,nop,wscale 7], length 0

which is a stuck TCP handshake from the Linux point of view. The same thing is probably visible in a live trace from docker run -it --privileged --net=host djs55/tcpdump -n -i eth0. Using sysinternals process explorer to examine the vpnkit.exe process, I only see 1 TCP connection at a time (although a larger than ideal number of UDP connections which are DNS-related I think). There's no sign of a resource leak. When this manifests I can still establish other TCP connections and run the test again -- the impact seems limited to the 1 handshake failure. The vpnkit diagnostics has a single TCP flow registered:

> cat .\flows
TCP 192.168.65.3:51580 > 192.168.65.2:6789 socket = open last_active_time = 1605023899.0

which means that vpnkit itself thinks the flow is connected, although the handshake never completed.

@rg9400 this was SUPER helpful... I started running into the same issue. I use dockerized Jupyter on Docker for Windows for a significant amount of my day-to-day work and have been getting CONSTANT timeout errors when I run notebooks from the beginning. I was also restarting Docker a ton, but after finding this comment, I found a way to pretty consistently "unstick" things (though it's definitely still annoying):

C:\Users\blah\blah> tasklist | findstr vpnkit.exe
C:\Users\blah\blah> taskkill /F /pid <pid of vpnkit>

And then I give it just a sec and when the cell tries again to reestablish connections, it's good.

I don't really think this is a viable solution for folks running processes that are constantly establishing connections, but it works for me currently for Jupyter (once I get data, I don't really need more connections for the notebook though).

Any update from folks working on this issue? I found something that sounds similar from 2019, but it doesn't look like anyone is making any effort to resolve it on the vpnkit side that I can find right off from searching the issues.

hakey1408 commented 2 years ago

Having this exact problem on MacOS. Restarting Docker fixes the problem (for a while).

I'm having the same issue, running docker 4.9.1 on mac and I'm facing the issue very often. After restarting Docker it works again but is not a long term solution as you mentioned ...

razorman8669 commented 2 years ago

I'm also affected by this issue. It seems to hang after just a few minutes and causes network timeouts. restarting docker fixes it for a few more minutes. it's practically unusable...

rossinineto commented 2 years ago

I tried to use the internal IP of docker host (instead of "host.docker.internal"), but the problem still occurs: in few minutes, the network connection timeout starts again. Just stop and start the container doesn´t fix the issue, only if i recreate the container. I´m working with Windows Docker Desktop, v.4.9.1, updated today!

rossinineto commented 2 years ago

Just run the command

taskkill /im vpnkit.exe /f

and connection with docker host is fixed for more few minutes.

razorman8669 commented 2 years ago

I can confirm that by downgrading docker desktop for mac to version 4.5.0 "fixes" the problem and I no longer have connection issues. I tested for several days without any problems. Then upgraded back to the latest version and started getting timeouts and connection failures almost immediately again.

ghost commented 2 years ago

+1

I’ve had this issue since 4.5.0 as well. I’m pinned to that release until it’s fixed.

On Sat, Jun 25, 2022 at 2:25 AM Ray Marceau @.***> wrote:

I can confirm that by downgrading docker desktop for mac to version 4.5.0 "fixes" the problem and I no longer have connection issues. I tested for several days without any problems. Then upgraded back to the latest version and started getting timeouts and connection failures almost immediately again.

— Reply to this email directly, view it on GitHub https://github.com/docker/for-win/issues/8861#issuecomment-1166203516, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJSHA5TXEYATIVIREMP6NDVQ2Q67ANCNFSM4SGL23LQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rossinineto commented 2 years ago

Like reported in some quotes here, since a I had a docker 4.9.1 installed, I downgraded docker windows to 4.5.1. The connection timeout doesn´t occur anymore.

Spenhouet commented 2 years ago

While the issue title only refers to connections to the host, it also happens to any other address (independent of internal or external).

rossinineto commented 2 years ago

In 4.10 release notes, I didn´t see any mention about this issue. Did anybody check if the 4.10 version still has this issue?

ChinhFong commented 1 year ago

I'm currently working with Laravel so in order to migrate database I have to use 127.0.0.1 and for connection to work I have to use host.docker.internal

DB_HOST=127.0.0.1

DB_HOST=host.docker.internal Windows 10 , Docker Desktop 4.10.1 (82475)

kevinmcmurtrie commented 1 year ago

We have reports of this occurring across teams on Windows and macOS as well. We have no reports of this issue occuring on Linux.

Someone noticed that on macOS, simply waiting ~15mins often alleviates the problem.

As of Docker version 20.10.17, build 100c701 on Ubuntu 22.04, it happens on Linux.

I'm using Docker to run Kiwix scrapers. They work while then start hitting timeouts. If the scraped host has multiple IP addresses, it can round-robin and keep running. If the host only has one, the task is likely to fail from too many timeouts.

Mathias-S commented 1 year ago

Everyone on my team with Mac (we only have mix of Mac and Linux) are also experiencing this problem. It manifests itself after 1-3 days without restarting, but during that timeframe we probably have on the order of 10,000 outbound HTTP calls, similar to what was mentioned in https://github.com/docker/for-win/issues/8861#issuecomment-903421446.

This seems like a very serious issue if it happens so consistently.

Our solution was to stop using Docker for certain containers with high traffic volume on (non-Linux) developer machines.

jdeitrick80 commented 1 year ago

I have also seen this issue in versions >4.5.1 including the latest version, but have found that it can also be triggered with low amounts of traffic. The following is how I have been able to reproduce the issue.

docker run --name session-test -it -v /mnt/c/Users/jde/sessions/test:/test python:buster bash
root@17c8b33e70e6:/# pip install --quiet requests
root@17c8b33e70e6:/# cd test/
root@17c8b33e70e6:/test# python sessions.py
10:52:18: request 1
Request complete, sleep 30
10:52:50: request 2
Request complete, sleep 120
10:54:51: request 3
Request complete, sleep 420
11:01:51: request 4
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

sessions.py

import requests
from datetime import datetime
import time

s = requests.Session()

steps = [30, 120, 420, 420]
step = 1
for i in steps:
    print(datetime.now().strftime("%H:%M:%S") + ": request " + str(step))
    r3 = s.get('https://wttr.in')
    print("Request complete, sleep " + str(i))
    step+=1
    time.sleep(i)

As has been mentioned before if I look at a trace from the containers point of view I only see TCP SYNs being sent out during the 4th attempt after waiting 420s since the last request. Also if I kill the vpnkit while it is still trying the 4th attempt then when the vpnkit starts back up the 4th requests is able to complete successfully.

Some things that I have noticed that I do not think were previously mentioned. If I look at a trace from the host I see the TCP SYNs going out and TCP SYN ACKs coming back from the server, but these are not passed on to the container. If I start up another container while the first is trying unsuccessfully to do the 4th attempt it also is not able to reach the same destination, but is able to reach other destinations.

docker run -it python:buster bash
root@14437db6e250:/# curl https://wttr.in
curl: (7) Failed to connect to wttr.in port 443: Connection timed out
root@14437db6e250:/# curl https://google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="https://www.google.com/">here</A>.
</BODY></HTML>
root@14437db6e250:/# curl https://wttr.in
curl: (7) Failed to connect to wttr.in port 443: Connection timed out

The cause of the issue seems to have something to do with using sessions and having a client side keep alive interval being >=60s. If I change to a 30s client keep alive interval I do not run into the issue.

docker run --name session-test -it -v /mnt/c/Users/jde/sessions/test:/test python:buster bash
root@425db0e6590a:/# pip install --quiet requests
root@425db0e6590a:/# cd test/
root@425db0e6590a:/test# python sessions-ka30.py
11:41:35: request 1
Request complete, sleep 30
11:42:06: request 2
Request complete, sleep 120
11:44:06: request 3
Request complete, sleep 420
11:51:06: request 4
Request complete, sleep 420
root@425db0e6590a:/test#

sessions-ka30.py

import requests
from datetime import datetime
import time

import socket
from requests.adapters import HTTPAdapter

class HTTPAdapterWithSocketOptions(HTTPAdapter):
    def __init__(self, *args, **kwargs):
        self.socket_options = kwargs.pop("socket_options", None)
        super(HTTPAdapterWithSocketOptions, self).__init__(*args, **kwargs)

    def init_poolmanager(self, *args, **kwargs):
        if self.socket_options is not None:
            kwargs["socket_options"] = self.socket_options
        super(HTTPAdapterWithSocketOptions, self).init_poolmanager(*args, **kwargs)

KEEPALIVE_INTERVAL = 30
adapter = HTTPAdapterWithSocketOptions(socket_options=[(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, KEEPALIVE_INTERVAL), (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, KEEPALIVE_INTERVAL)])
s = requests.Session()
s.mount("http://", adapter)
s.mount("https://", adapter)

steps = [30, 120, 420, 420]
step = 1
for i in steps:
    print(datetime.now().strftime("%H:%M:%S") + ": request " + str(step))
    r3 = s.get('https://wttr.in')
    print("Request complete, sleep " + str(i))
    step+=1
    time.sleep(i)

I hope this information helps in resolving the issue or provides a work around for others experiencing it.

I have also added this information to https://github.com/moby/vpnkit/issues/587

rossinineto commented 1 year ago

V4.5.1 works fine. Than I removed the docker desktop windows, cleaned all the junk files left and reinstalled the version v.4.12.0. And the connection problem started again.

rossinineto commented 1 year ago

Randomly the host docker connection stops.

Captura de tela 2022-09-12 073428

TomasAndersen commented 1 year ago

I am using Docker Desktop 4.11.1 (84025) on a MAC. I am running 4 python processes in separate containers and when the processes have run for a day or two I am starting to get timeouts.. Our connection timeout is set to 300 seconds.. so the delay (if there is any network traffic at all) is significant.. Connections to applications on localhost seems to work ok as we have some internal traffic which is not affected.

A restart of the Docker Desktop application solves the problem and the connection is back up again. There is no network issues on the host machine as I can reach the URL we try to reach via the containers the whole time.

We have added this in our docker-compose file extra_hosts:

nk9 commented 1 year ago

My team is using Docker Desktop on both Macs and Windows in multiple countries, and I beleive we have been seeing this for many months. It manifests as having connections to two AWS hosts hang after awhile: ssm.us-east-2.amazonaws.com and cognito-idp.us-east-2.amazonaws.com. Other hosts work fine, and the two hosts are reachable from outside Docker. After 5–10 minutes, connections stop hanging. But the longer the Docker process has been running, the more frequently the connection failures occur.

Restarting the container doesn't fix the problem. Restarting the Docker process stops it from happening for awhile.

I was on a series of recent versions of Docker Desktop for Mac, most recently 4.13.0 I think, and I was seeing the problem reliably. After finding this issue, I've gone back to 4.5.0. Things have been fine for a day, so I'm hoping that's a workaround while we wait for a fix.

I am concerned that this issue doesn't have the priority it deserves. I'm not doing anything particularly intensive with my container, just running a web app. We don't even hit the AWS servers that often. I think many people are probably seeing this and blaming it on ISP/network issues or hosts being down, when in fact the problem is Docker itself. It took me months of on-and-off debugging (because the problem is so intermittent) to finally look for a GitHub issue here; I thought it was something to do with the way I was calling the AWS hosts. I get that the solution is hard, but I'd like to see Docker at least announce that they have a plan to address this after 2 years of steadily increasing impact.

rg9400 commented 1 year ago

I think the only way to increase visibility is to submit official bug reports to the team linking to this issue. My official ticket with them (where the issue was discovered) keeps asking me to confirm it is still an issue to prevent auto-closure, and I have not been able to get a team's update beyond one from a few months ago where they asked if this had resolved itself (to which I replied it had not).

robertnisipeanu commented 1 year ago

This issue was fixed for me on MacOS after editing ~/Library/Group\ Containers/group.com.docker/settings.json and setting vpnKitMaxPortIdleTime from 300 to 0 (Docker Desktop restart required after). I have changed this over a week ago and till now I did not encounter the issue again.

I don't know how this can be changed in windows tho

rg9400 commented 1 year ago

The reason I mentioned contacting Docker was because this was my latest communication on them, which you can see is pretty frustrating

I did some investigation on this one.

We have parsed through the thread and users seem to be able to address the issue by restarting WSL, so it’s not a blocker. A ticket has been created from the begining (2 years ago) but as it is not a blocker, we don’t see us getting to this over the course of the next few months. Once resolved we will update the issue on github.

Sorry for the incovenience.

Doscker Support

nk9 commented 1 year ago

Thanks for the update, and I will file a support request. Apart from anything else, their answer ignores that this is an issue for users who aren't on Windows. Just restarting Docker Desktop all the time isn't an acceptable workaround IMO.

levimatheri commented 1 year ago

Sorry for the incovenience. Doscker Support

Lol they can't even spell their own product sigh...

Spenhouet commented 1 year ago

Just a reminder that this is actually an issue with vpnkit and also affects Mac. That partially might be the reason why no one of the docker for win team feels responsible fixing this.

We need to actively raise this on the vpnkit project and not here!

e.g. you can voice your support here: https://github.com/moby/vpnkit/issues/587