Network Errors, Connection Resets, Requiring Docker Restart

micchickenburger commented 5 years ago

[x] I have tried with the latest version of my channel (Stable or Edge)
[x] I have uploaded Diagnostics
Diagnostics ID: 70614E8F-4B41-4857-9044-8A1A2C1A5FFC/20190108175851

Expected behavior

Networking between my Mac OS host and running docker containers should not sporadically stop working.

Actual behavior

I'm experiencing network errors (connection resets, network errors) between my Mac host and all running docker containers. Restarting the containers does not resolve the problem. I have to restart docker to solve the problem. I've experienced this problem twice today, but had never experienced it before.

Information

macOS Version: macOS Mojave 10.14.2

Diagnostic logs

Steps to reproduce the behavior

It's hard to reproduce. Connectivity just ceases seemingly randomly, requiring me to restart docker. This is occurring on a fresh docker install as well.

micchickenburger commented 5 years ago

When the issue occurs, this is the error I receiving trying to run mongo on my host to connect with the monger server in my mongo docker container.

$ mongo
MongoDB shell version v4.0.2
connecting to: mongodb://127.0.0.1:27017
2019-01-08T12:18:13.494-0600 E QUERY    [js] Error: network error while attempting to run command 'isMaster' on host '127.0.0.1:27017'  :
connect@src/mongo/shell/mongo.js:257:13
@(connect):1:6
exception: connect failed

This is the issue I experienced trying to create an SQS queue on localstack:localstack running in a container:

Starting Localstack...
Started Localstack with hash 8a42a0e43174653894b4e1985d7d835f48b8da6ab46e7ecef57f66edaff4176d
Creating queue jobs at SQS endpoint http://localhost:4576/queue/jobs
An error occurred (502) when calling the CreateQueue operation (reached max retries: 4): Bad Gateway

Image versions:

$ docker image ls
REPOSITORY              TAG                 IMAGE ID            CREATED             SIZE
localstack/localstack   latest              c583aaf39486        4 days ago          1.02GB
mongo                   latest              7177e01e8c01        10 days ago         393MB

piniondna commented 5 years ago

I am also experiencing this issue. Same exact Docker for Mac version. The docker compose containers will start up fine after a system reboot or docker restart, but the network mapping to the host will stop responding after a (seemingly) random amount of time. I can still bash into the containers, and they access each other through curl, so the docker sub-network seems to be fine, but there is no system host mapping. I get timeouts trying to access them from a browser.

I've allocated 16gb to docker and 6 cpus, so I don't think it's a resource issue. The stack uses roughly 6gb of memory.

lduchesne commented 5 years ago

I too had this problem and downgrading to Docker Community Edition 18.06.0 seemed to have fixed it. It might be related to https://github.com/docker/for-mac/issues/3360

schmurgon commented 5 years ago

Same problem here on all versions higher than 18.06.1-ce-mac73 (26764). For me, the network connections drop out when using the JVM debugger after about 60 seconds.

onmomo commented 5 years ago

Same Problem here: Diagnostics ID: 2D80AB9C-9218-4705-933D-0B9B7525F15A/20190104105334 Might be related to #3417

ivester commented 5 years ago

Same problem here

retoheusser commented 5 years ago

I experience this too...

jeff-cook commented 5 years ago

I experience this as well. Same version.

docker version Client: Docker Engine - Community Version: 18.09.0 API version: 1.39 Go version: go1.10.4 Git commit: 4d60db4 Built: Wed Nov 7 00:47:43 2018 OS/Arch: darwin/amd64 Experimental: false

Server: Docker Engine - Community Engine: Version: 18.09.0 API version: 1.39 (minimum version 1.12) Go version: go1.10.4 Git commit: 4d60db4 Built: Wed Nov 7 00:55:00 2018 OS/Arch: linux/amd64 Experimental: true

jeff-cook commented 5 years ago

I have noticed that when it happens the CPU spikes. I have not figured out if the spike is a cause or a symptom.

jeff-cook commented 5 years ago

It defiantly seems load based. I was able to make it a day and overnight without network loss. I was only running one small web service. First time in a week I have been able to run longer than an hour or two.

josh-h commented 5 years ago

@jeff-cook I'm not certain it is load related. I experienced similar symptoms with UDP packets failing to be bridged into the container. In each of the several cases the container ran for 24 hours or so and when returning to the office in the morning the networking to the container was dead. During the day the host would have constant use with periods of the CPU being driven hard. So, I suspect it is not load, but some other condition that triggers the bug.

My environment is a mac server with a static IP assigned, so it is not related to sleep or flaky wifi connections. Reverting to 18.06.1 seems to be a valid workaround, so far.

jeff-cook commented 5 years ago

It happened (76297123-260B-45B6-872E-9DE74FB5F950/20190111200619) even after a downgrade to Version 2.0.0.0-mac78 (28905) c404a62c3f

josh-h commented 5 years ago

This issue occurred again, after 2 days of uptime, running Server Version: 18.06.1-ce.

I've determined that ingress network connections succeed (ie. I see UDP network packets arrive in the container). However, outbound connections from the container fail (ie. ping google.com doesn't receive any responses). Restarting the engine resolved the outbound connection issue.

Tobsucht commented 5 years ago

Same here. As @josh-h already mentioned, I don't think it is load related.

macOS Mojave Version 10.14

bildschirmfoto 2019-01-23 um 12 59 20

chenxushuo commented 5 years ago

Same problem here！！But ，after 30min i can get this error when I restart docker service every time ！

VillanCh commented 5 years ago

me, too. I am using RabbitMQ, Postgres in Django App dev. After publishing or receiving some data, the network will be failed to my localhost, I cannot connect to rabbitmq and postgres from host. But I can still docker-compose exec postgres bash to enter the container to test, in the container, I can connect the db and mq.

Besides, I cannot ping the Postgres and RabbitMQ successfully.

After restart docker for mac, all recover. but a few moment later, it came to me again and again.....

I have to stop my coding to restart the docker, what a mess...

MarounMaroun commented 5 years ago

Same problem here. Does anyone have a solution already?

elhay-av commented 5 years ago

Same problem here. Does anyone have a solution already?

https://github.com/docker/for-mac/issues/3448#issuecomment-452490002

jezao commented 5 years ago

Same problem here

MarounMaroun commented 5 years ago

@jezao Downgrading to 18.06.0-ce-mac70 2018-07-25 solved my problem. Download here: https://download.docker.com/mac/stable/26399/Docker.dmg

daviyang35 commented 5 years ago

Same problem.

macOS 10.14.4 Docker Desktop 2.0.0.3(31259)

docker engine

Client: Docker Engine - Community
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        6247962
 Built:             Sun Feb 10 04:12:39 2019
 OS/Arch:           darwin/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.2
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       6247962
  Built:            Sun Feb 10 04:13:06 2019
  OS/Arch:          linux/amd64
  Experimental:     false

docker-compose

docker-compose version 1.23.2, build 1110ad01
docker-py version: 3.6.0
CPython version: 3.6.6
OpenSSL version: OpenSSL 1.1.0h  27 Mar 2018

MarounMaroun commented 5 years ago

@daviyang35 Did you try to downgrade to 18.06.0-ce-mac70 2018-07-25? See my comment above.

daviyang35 commented 5 years ago

@MarounMaroun Yes. Use your link can ignore this issues. Thanks.

keymandll commented 5 years ago

I'm experiencing the same issue I guess. (Engine version 18.09.2)

Normally, it happens after a week of running like 10 containers. The load is always extremely light. (I have no numbers) What I have noticed yesterday is that network traffic higher than usual (~200 transaction/second) results in losing the network connection to containers from the host. By one transaction I mean a set of connect, send, receive, disconnect operations. Doing the exact same operations at 1 transaction/sec did not trigger the issue.

I remember seeing a log entry somewhere produced by docker about syn flood which could be related. Unfortunately I do not remember neither can figure out where I saw it.

I have noticed that when I experience the issue the networking between the containers still work fine.

lucas-bremond commented 5 years ago

Same issue. Seems to occur randomly. Restarting Docker fixes it.

dmuth commented 5 years ago

I had something similar (#3674) recently, and I ended up writing a test to reliably catch the issue, which I published at https://github.com/dmuth/docker-health-check

In my case, downgrading to 18.06.0-ce-mac70 also worked.

-- Doug

Kaelten commented 5 years ago

still an issue with the most recent release, downgrading seems to alleviate.

anirudhwarrier commented 5 years ago

Issue exists even on 2.0.0.3. Had to downgrade to 18.06.0-ce-mac70.

wilomgfx commented 5 years ago

Same here, i keep getting network issues. Containers are running fine, i can connect to them via bash, but they won't load on my browser or connect to each other.

rubnov commented 5 years ago

I've been banging my head against the wall with this issue. I'm running a Django app in a docker container, which connects to a Postgres database on the host machine. I am getting the following error quite often (every 2 - 10 minutes):

django.db.utils.OperationalError: could not connect to server: Connection timed out

I've downgraded from Docker 2.1.0.3 (currently the latest) to 18.06.0-ce-mac70 as suggested above, and the error above disappeared, but only to be replaced by this error, which occurs even more often:

django.db.utils.OperationalError: could not connect to server: Connection refused

This seems to be related to load / number of requests. The issue occurs more often in high load situations.

Any suggestions on what I can do next?

kiragaz commented 4 years ago

Problem is still reproducible on 2.1.0.5

docker-robott commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30d of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

micchickenburger commented 4 years ago

/remove-lifecycle stale

dko-slapdash commented 4 years ago

@guillaumerose could you please help with this issue? Basically, docker for mac can't be reliably used for development in some cases, the containers just stop responding on network connections.

A lot of people complain for >1Y.
People even start writing TOOLS to work-around this issue by restarting Docker (with AppleScript).
@dmuth has clear steps to reproduce it in https://github.com/docker/for-mac/issues/3674 and https://github.com/dmuth/docker-health-check
All engineers in our company suffer from something which looks very similar to the current issue: we have to restart a container with elasticsearch time few times a day.

E.g. right now I have the following symptoms for elasticsearch container listening on port 9200:

$ curl -vvv http://127.0.0.1:9200/zzz
<fails when running from Mac>
curl: (7) Failed to connect to 127.0.0.1 port 9200: Operation timed out

$ netstat -nav | grep 9200
<shows 128 CLOSE_WAIT connections which are stale forever>
tcp4     271      0  127.0.0.1.9200         127.0.0.1.59438        CLOSE_WAIT  408300 146988  10666      0 0x1123 0x00000024
tcp4     271      0  127.0.0.1.9200         127.0.0.1.59437        CLOSE_WAIT  408300 146988  10666      0 0x1123 0x00000024
tcp4     271      0  127.0.0.1.9200         127.0.0.1.59436        CLOSE_WAIT  408300 146988  10666      0 0x1123 0x00000024
tcp4     271      0  127.0.0.1.9200         127.0.0.1.59435        CLOSE_WAIT  408300 146988  10666      0 0x1123 0x00000024
tcp4       0      0  127.0.0.1.9200         *.*                    LISTEN      131072 131072  10666      0 0x0100 0x00000026

# ps aux | grep 10666
dko           10666   0.0  0.4  5064504 143208   ??  S    Tue02PM  11:27.27 /Applications/Docker.app/Contents/MacOS/com.docker.backend -watchdog

$ pgrep node
<shows nothing, i.e. the connecting processes died long time ago>

$ sudo tcpdump -i any -n -p tcp port 9200 & curl -vvv http://127.0.0.1:9200/zzz
<only SYN packets travel?>
02:09:59.303560 IP 127.0.0.1.63681 > 127.0.0.1.9200: Flags [S], seq 1329676029, win 65535, options [mss 16344,nop,wscale 6,nop,nop,TS val 664304244 ecr 0,sackOK,eol], length 0
...
02:10:02.815838 IP 127.0.0.1.63681 > 127.0.0.1.9200: Flags [S], seq 1329676029, win 65535, options [mss 16344,nop,wscale 6,nop,nop,TS val 664307746 ecr 0,sackOK,eol], length 0
02:10:06.022543 IP 127.0.0.1.63681 > 127.0.0.1.9200: Flags [S], seq 1329676029, win 65535, options [mss 16344,sackOK,eol], length 0
...
02:10:18.867903 IP 127.0.0.1.63681 > 127.0.0.1.9200: Flags [S], seq 1329676029, win 65535, options [mss 16344,sackOK,eol], length 0

$ docker-compose exec elasticsearch bash
# curl http://127.0.0.1:9200/
<succeeds from inside the container>
{
  "name" : "41d967fa27a4",
  "cluster_name" : "docker-cluster",

piniondna commented 4 years ago

Has this issue been officially addressed or acknowledged at all? I’ve personally had multiple colleagues experience this same issue, and it seems to appear more with relatively complicated/large stacks.

It’s also complicating my efforts to convert developers into Docker uses when they experience issues like this that shake their confidence in the technology.

The fact that there hasn’t been any official acknowledgement of this problem strange. I think any number of devs (myself included) would be more than happy to help diagnose the root cause if called upon, but that hasn’t happened in more than a year?

Is Docker still being developed as a product? Or has is secretly gone into maintenance mode?

slikts commented 4 years ago

Note that downgrading should probably still be a viable workaround.

thtas commented 4 years ago

Just anecdotal... But I suffered with this issue for a while, trying a new version every few months and then downgrading back to version i knew worked. Eventually somebody suggested it might be a memory issue, so i upgraded again and gave Docker a heap more memory (currently allocated 12gb) and the problem went away. It's been solid for a few months now.

piniondna commented 4 years ago

Adding resources helped for me up to a point, but wasn’t a panacea.

Even if the “fix” is just surfacing an error message to the user, this is better than a silent failure leading to hours of app troubleshooting, just to find out it’s a failure with docker.

coding-bunny commented 4 years ago

This issue is still a big pain. My ElasticSearch container randomly spits out the error about a GET call failed and the entire docker crashes and needs to be restarted. This makes development really difficult.

dko-slapdash commented 4 years ago

BTW I mitigated it a little by just opening much less connections to ES and making those connections persistent.

It looks like there is some memory (or connections?) leak in docker port forwarding proxy, so if you open and close a lot of concurrent connections from host to a container and do it many times per second, eventually these connections leak (even when the process which opens connections is terminated - so it's really a bug in docker-desktop, not in the calling app). I posted netstat/tcpdump above.

Before, we were accidentally opening one new connection per ES query and then closing them at random moments of time (sometimes keeping them open for a long time). After we switched to persistent connections, the docker bug disappeared too.

flow3d commented 4 years ago

This issue is open for over a year without any viable workaround (installing a 2018 version doesn't work, at least for me).

Any other suggestions?

Talhazzers commented 4 years ago

Same problem.

docker-robott commented 4 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

flow3d commented 4 years ago

/remove-lifecycle stale

rndnoise commented 3 years ago

I have the same problem. At least I'm not alone, I guess.

danielkihlgren commented 3 years ago

I'm having this issue too. Are there any official statements regarding this issue? I'm using v 3.1.0 @ MacOs Catalina 10.15.7 I can't ping anything on my network from within the container. I can reach other services within the container. I'm also able to connect to the container from the outside using the exposed ports.

simonjthomas commented 3 years ago

Confirming the same issue. It's only started relatively recently, I think after I added a new container to a compose file but I'm not certain as the timing doesn't line up perfectly. I've just thrown a load more resources at docker to see if it helps, but it had plenty already. Docker 3.1.0 on MacOS 10.15.7.

I can access the container's web server through a HTTP connection, but the containers can't communicate with each other and can't talk to the outside world (pinging an external site from within the container doesn't work). It works fine for about two days at a time before entering this state with no config changes. restarting containers doesn't help, only restarting docker or the machine docker's running on.

Edited to add that after 2 days with increased resources, it's lost networking again. So additional resources appears to have no impact.

christianhuening commented 3 years ago

same here, Docker 3.3.1 on Big Sur 11.2.3

npoczynek commented 3 years ago

Seeing the same issue with Docker 3.3.3 on Big Sur 11.3.1. I suspect @dko-slapdash is on the right track re: resource leak related to number or rate of connections; running an aggressive nmap scan from a container to the host results in almost immediate network failures that are only resolved by restarting Docker engine.

marcellp commented 3 years ago

Pinging some maintainers here: @StefanScherer @djs55 @stephen-turner in case this issue flew under their radars.

As a summary of this thread, users have been experiencing intermittent network failures, connection resets, etc., requiring a full restart of Docker or pruning all images and networks for the thing to come back on-line. Seems to be related to a resource/memory leak, and happens more often under heavy load spikes and adding more resources to Docker seems to work for a little bit. There seem to be relatively easy steps to reproduce this problem at will, see e.g. https://github.com/docker/for-mac/issues/3448#issuecomment-628507263.

There are many other issues out there that seem to reference the same problem, e.g. #5538 #3674 #3360, all of these have since been closed or are stale since no responses have been received from the maintainers. These issues go as far back as 2008, since the new Docker Desktop for Mac was introduced.

We have been experiencing similar problems with Docker for Mac for the past 6 months and our team is often losing hours of productivity because of this issue. Would really appreciate it if one of you could look into this.

docker / for-mac