docker / for-mac

Bug reports for Docker Desktop for Mac
https://www.docker.com/products/docker#/mac
2.43k stars 118 forks source link

Network randomly working slow on M1 RC3 #5548

Open sam0delkin opened 3 years ago

sam0delkin commented 3 years ago

I found that on RC3 build networking is randomly working slow, see these two examples: 1 (this is sequential call of show databases in MySQL which is running ARM version):

docker@a155cab3efd6:/$ time mysql -h mysql -u root -proot -e "SHOW DATABASES" > /dev/null

real    0m0.005s
user    0m0.002s
sys 0m0.001s
docker@a155cab3efd6:/$ time mysql -h mysql -u root -proot -e "SHOW DATABASES" > /dev/null

real    0m5.027s
user    0m0.004s
sys 0m0.006s
docker@a155cab3efd6:/$ time mysql -h mysql -u root -proot -e "SHOW DATABASES" > /dev/null

real    0m5.023s
user    0m0.001s
sys 0m0.008s
docker@a155cab3efd6:/$ time mysql -h mysql -u root -proot -e "SHOW DATABASES" > /dev/null

real    0m0.021s
user    0m0.007s
sys 0m0.007s

So, in some cases additional 5sec delay is adding on the connection time. I thought that problem is in MySQL itself, but i tried same thing with another container (running elasticsearch, on ARM as well): 2 (this is sequential call of just curl default elasticsearch port):

docker@a155cab3efd6:/$ time curl elasticsearch:9200 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   338  100   338    0     0  56333      0 --:--:-- --:--:-- --:--:-- 56333

real    0m0.024s
user    0m0.013s
sys 0m0.008s
docker@a155cab3efd6:/$ time curl elasticsearch:9200 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   338  100   338    0     0     66      0  0:00:05  0:00:05 --:--:--    84

real    0m5.104s
user    0m0.008s
sys 0m0.008s
docker@a155cab3efd6:/$ time curl elasticsearch:9200 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   338  100   338    0     0  37555      0 --:--:-- --:--:-- --:--:-- 42250

real    0m0.023s
user    0m0.010s
sys 0m0.007s

So, here one can see absolutely same situation like with MySQL connection. I thought that problem with DNS, but after i run command time nslookup elasticsearch 100 times in a row, i didn't see any performance issues with DNS resolver, so looks like problem is in the network itself.

Additionally, here my images of MySQL and Elasticsearch:

$ docker inspect mysql | grep Image
        "Image": "sha256:6655265fdbd725b8e57b149b2e4076e15f8aedef317eb3dd8aa87906df52d6fd",
            "Image": "mysql/mysql-server:8.0",

$ docker inspect 6655265fdbd725b8e57b149b2e4076e15f8aedef317eb3dd8aa87906df52d6fd | grep Architecture
        "Architecture": "arm64",

$ docker inspect elasticsearch | grep Image
        "Image": "sha256:e9105bf3583abab5debf4d13bf07ce96aecff98f7f3b81940f4a7988c07edb6f",
            "Image": "itedev/elasticsearch_1.7_arm",

$ docker inspect e9105bf3583abab5debf4d13bf07ce96aecff98f7f3b81940f4a7988c07edb6f | grep Architecture
        "Architecture": "arm64",
sam0delkin commented 3 years ago

actually, after some research i found that this is DNS resolving issue. So, when i changed my test snippets to connect to the services directly via IP problem is not reproduced. Finally, i installed a tcpdump util to check the DNS traffic and found that resolver did not reply in time, see:

# tcpdump -n -i lo port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
06:23:07.378225 IP 127.0.0.11.53 > 127.0.0.1.44276: 38280 1/0/0 A 172.18.0.4 (60)
06:23:07.379701 IP 127.0.0.11.53 > 127.0.0.1.44276: 38536 0/0/0 (31)
06:23:19.938697 IP 127.0.0.11.53 > 127.0.0.1.46712: 64518 1/0/0 A 172.18.0.4 (60)
06:23:19.938771 IP 127.0.0.11.53 > 127.0.0.1.46712: 64518 0/0/0 (31)
06:23:20.940249 IP 127.0.0.11.53 > 127.0.0.1.34752: 16399 0/0/0 (31)
06:23:20.940414 IP 127.0.0.11.53 > 127.0.0.1.34752: 16399 1/0/0 A 172.18.0.4 (60)
06:23:25.946796 IP 127.0.0.11.53 > 127.0.0.1.34752: 16399 1/0/0 A 172.18.0.4 (60)
06:23:25.948185 IP 127.0.0.11.53 > 127.0.0.1.34752: 16399 0/0/0 (31)

and corresponding snippet output:

$ time curl elasticsearch:9200 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   336  100   336    0     0   164k      0 --:--:-- --:--:-- --:--:--  164k

real    0m0.008s
user    0m0.005s
sys 0m0.001s

$ time curl elasticsearch:9200 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   336  100   336    0     0     65      0  0:00:05  0:00:05 --:--:--    83

real    0m5.108s
user    0m0.006s
sys 0m0.009s
Zajozor commented 3 years ago

I'm having the same problem with a different container/image, using M1 Mac.

In my case I have a two containers - a REST server (Django) and a Database (postgres). When using the hostname assigned by docker-compose to access the database from inside the container of the REST server, random 5sec hangs occur.

When I change the configuration to use the IP address instead of the hostname, the hanging does not occur.

@sam0delkin thanks for pointing out that it might be a DNS resolution problem - it helped me a lot that I could "hotfix" this by using the IP for now :)

If there is any other info needed that could help fix this, please let me know!

wb-joona commented 3 years ago

I'm experiencing this same issue. Connections from my PHP application devcontainer to mysql and redis, when using the names assigned by docker-compose, experience 5 second delays, and in general are just very much slower (requests taking 1-3 seconds usually). With a PHP project, I suspect since PHP-FPM uses worker processes to handle each request (no DNS cache, maybe?), the project feels VERY sluggish.

The workaround suggested by @Zajozor which was to change our environment variables to connect to the docker mysql and redis services via IP instead of host name worked amazingly well. I got the IPs by doing curl -v mysql and curl -v redis inside our dev container.

annemarietannengrund commented 2 years ago

I'm experiencing this same issue. Connections from my PHP application devcontainer to mysql and redis, when using the names assigned by docker-compose, experience 5 second delays, and in general are just very much slower (requests taking 1-3 seconds usually). With a PHP project, I suspect since PHP-FPM uses worker processes to handle each request (no DNS cache, maybe?), the project feels VERY sluggish.

The workaround suggested by @Zajozor which was to change our environment variables to connect to the docker mysql and redis services via IP instead of host name worked amazingly well. I got the IPs by doing curl -v mysql and curl -v redis inside our dev container.

i have the same (network/DNS/latency) issue with PHP and a microservice setup. changing urls to IPs does improve the situation, but isnt really fitting into my workflow. atm im the only one on my team with a m1 mac, that makes it harder to argue to change the projectstructure for 1 case where docker is failing its own purpose. the quicker and nonworkflowbreaking solution to me was to retrieve a cheap x86 MBA and wait for a fix of docker. this issue is building up frustration for me, a fix on the dockerside to achieve x86 performance networkwise would be appreciated.

plufz commented 2 years ago

I’m scratching my head over this. How is this bug not fixed after such a long time and not even addressed by the staff in the comments? Sorry for being dramatic but this makes docker for m1 almost useless, at least til you know that DNS is broken and change everything to IP. Running my acceptance tests on mac intel takes 3 minutes on my maxed m1 it takes 30min.

annemarietannengrund commented 2 years ago

I’m scratching my head over this. How is this bug not fixed after such a long time and not even addressed by the staff in the comments? Sorry for being dramatic but this makes docker for m1 almost useless, at least til you know that DNS is broken and change everything to IP. Running my acceptance tests on mac intel takes 3 minutes on my maxed m1 it takes 30min.

the issue lies within the fact that the issue was created after the macbook airs with m1 were released. then the pro and max books came out, and i assume they are to cool for the cheapest apple product and cant reproduce it on their pro and max devices.

plufz commented 2 years ago

and cant reproduce it on their pro and max devices.

Hmm I'm not sure I follow. :) I have a maxed max (64GB ram, 10 core M1 Max) with this issue.

pascalvgemert commented 2 years ago

@plufz just to confirm, which docker for mac version are you running?

plufz commented 2 years ago

@plufz just to confirm, which docker for mac version are you running?

Docker Desktop 4.2.0 (Apple Silicon build). Docker version 20.10.10, build b485636.

Update: And the intel mac I'm comparing it to is a Macbook Pro 13" 2018 (A1989). So it's not exactly a monster intel performance wise.

pascalvgemert commented 2 years ago

@plufz try the 4.3.0 update. That update solved a couple of my issue last week!

annemarietannengrund commented 2 years ago

gone from buster-slim to bullseye-slim removed the issue for me.

pinktig commented 2 years ago

gone from buster-slim to bullseye-slim removed the issue for me.

Thanks for the update!

I've been struggling with this issue myself and had monkey patched the issue with the temp solution suggested by @SwenVanZanten in #5626. But then I tried to do "docker-compose run --rm [service]" in parallel with running containers, I got "Address already in use" error response.

I changed my base image from buster to bullseye as suggested, and now the problem seems to be gone.

SwenVanZanten commented 2 years ago

gone from buster-slim to bullseye-slim removed the issue for me.

Thanks for the update!

I've been struggling with this issue myself and had monkey patched the issue with the temp solution suggested by @SwenVanZanten in #5626. But then I tried to do "docker-compose run --rm [service]" in parallel with running containers, I got "Address already in use" error response.

I changed my base image from buster to bullseye as suggested, and now the problem seems to be gone.

Yup looks likes that solved the problem.

fzberlin23 commented 2 years ago

gone from buster-slim to bullseye-slim removed the issue for me.

thank you so much @annemarietannengrund! works!

docker-robott commented 2 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

jtraulle commented 2 years ago

/remove-lifecycle stale

ilyakamens commented 2 years ago

We encountered the same problem, and the workaround as described here also works for us.

docker-robott commented 1 year ago

There hasn't been any activity on this issue for a long time. If the problem is still relevant, add a comment on this issue. If not, this issue will be closed in 30 days.

Mark the issue as fresh with a /remove-lifecycle stale comment. Stale issues will be closed after an additional %v days of inactivity.

Prevent issues from auto-closing with a /lifecycle frozen comment.

/lifecycle stale

ilyakamens commented 1 year ago

This is still relevant.

sam0delkin commented 1 year ago

/remove-lifecycle stale

glennschmidt commented 1 year ago

Can replicate this on an M1 Max, macOS 13.0.2, Docker Desktop 4.15.0

5-second delay when connecting to another container by its DNS name. Using the IP eliminates the delay.

joumak commented 1 year ago

Still issue with Docker Desktop v4.15.0 MacOS 12.4 MacBook Air M1 2020 (and with all colleaques macs with different Apple Architecture processors)

beaugunderson commented 1 year ago

still an issue on M1 using Docker 4.17.0

dont-panic-42 commented 1 year ago

Docker Desktop 4.18.0 2022 M1 Max MacBook Pro macOS 12.6

I was experiencing the same slow and randomly very slow responses. The problems appeared when I swtiched from a Debian 11 based image to one based on Debian 10, and disappeared again when I changed again to a different Debian 11 image:

It took me a day of fruitless and frustrating tracing until I came across this issue, thank you OP and others for documenting it.

F7b5 commented 1 year ago

Docker Desktop 4.21.1 (114176) 2022 M2 MacBook Pro macOS 13.4

Looking like the same issue, however with a little variation :

I'm running a website using php-fpm, nginx, redis and mysql. It ran fine on a 2021 M1 MBP. Then I migrated on a 2022 M2 MBP, using the Apple utility for migrating the datas and it suddenly became slow on the new computer.

The fix for me was to replace php:8.1.18-fpm by php:8.1.18-fpm-bullseye in my Docker image. Now it's running fine. Somehow, this issue doesn't seem to affect php:8.1-fpm, my other websites running this image are fast.

jfillmore commented 3 months ago

This problem has been giving me tons of grief with frequent speedbumps due to these DNS timeouts. Specifically, I'm trying to resolve service hostnames setup by docker compose that are part of a custom overlay network. I'm also on a Macbook Pro (M3 Max / Nov 2023). Docker version is 4.31.0 (153195) with Compose: v2.27.1-desktop.1.

As per earlier comments, upgrading from buster to bullseye does seem to help. In particular, this seems to upgrade from libnss_dns-2.28.so to libnss_dns-2.31.so. I tried a bunch of strace sessions to try to figure out why. The biggest difference I can find is that the newer version uses setsockopt(7, SOL_IP, IP_RECVERR, [1], 4) which presumably enables better error UDP detection so we don't have to wait the default of 5 seconds for the ppoll timeouts and can retry faster.

However, despite it's reproducibility (~30-50% of requests) before, after upgrading I have yet to observe a single instance of sending additional UDP messages indicating that this potential faster error recovery is helping. However, for v2.28 it's clear below that after sendmmsg times out we retry with two individual messages via sendto and finally succeed.

This still doesn't address what seems to be the core issue: why is the internal Docker DNS resolver so often failing to respond quickly to internal DNS lookups?

libnss_dns-2.31.so

socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 7
setsockopt(7, SOL_IP, IP_RECVERR, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, 16) = 0
ppoll([...], 1, {tv_sec=0, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=0, tv_nsec=0})
sendmmsg(7, [...], 2, MSG_NOSIGNAL) = 2
ppoll([...], 1, {tv_sec=5, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=999956666})
ioctl(7, FIONREAD, [34]) = 0
recvfrom(7, ""..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 34
ppoll([...], 1, {tv_sec=4, tv_nsec=999000000}, NULL, 0 <unfinished ...>
<... ppoll resumed>) = 0 (Timeout)
<... ppoll resumed>) = 1 ([...], left {tv_sec=4, tv_nsec=998999041})
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, 8) = 0
ioctl(7, FIONREAD <unfinished ...>
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART},  <unfinished ...>
<... ioctl resumed>, [66]) = 0
<... rt_sigaction resumed>NULL, 8) = 0
recvfrom(7,  <unfinished ...>
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART},  <unfinished ...>
<... recvfrom resumed>""..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 66
<... rt_sigaction resumed>NULL, 8) = 0
close(7 <unfinished ...>

libnss_dns-2.28.so

connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, 16) = 0
ppoll([...], 1, {tv_sec=0, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=0, tv_nsec=0})
sendmmsg(3, [...], 2, MSG_NOSIGNAL) = 2
ppoll([...], 1, {tv_sec=5, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=999999208})
ioctl(3, FIONREAD, [34]) = 0
recvfrom(3, ""..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 34
ppoll([...], 1, {tv_sec=4, tv_nsec=999000000}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=998999125})
ioctl(3, FIONREAD, [66]) = 0
recvfrom(3, ""..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 66
ppoll([...], 1, {tv_sec=4, tv_nsec=999000000}, NULL, 0 <unfinished ...>
<... ppoll resumed> ) = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, NULL, 8) = 0
ppoll(NULL, 0, {tv_sec=0, tv_nsec=3000000}, NULL, 0) = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, NULL, 8) = 0

... above lines repeat until `tv_nsec` reaches the 5sec timeout

rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, NULL, 8) = 0
ppoll(NULL, 0, {tv_sec=0, tv_nsec=199000000}, NULL, 0 <unfinished ...>
<... ppoll resumed> ) = 0 (Timeout)
ppoll([...], 1, {tv_sec=0, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=0, tv_nsec=0})
sendto(3, ""..., 34, MSG_NOSIGNAL, NULL, 0) = 34
ppoll([...], 1, {tv_sec=5, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=999998000})
ioctl(3, FIONREAD, [66]) = 0
recvfrom(3, ""..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 66
ppoll([...], 1, {tv_sec=4, tv_nsec=996000000}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=995998333})
sendto(3, ""..., 34, MSG_NOSIGNAL, NULL, 0) = 34
ppoll([...], 1, {tv_sec=4, tv_nsec=995000000}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=994998583})
recvfrom(3, ""..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 34
close(3)      = 0