Open sam0delkin opened 3 years ago
actually, after some research i found that this is DNS resolving issue. So, when i changed my test snippets to connect to the services directly via IP problem is not reproduced. Finally, i installed a tcpdump util to check the DNS traffic and found that resolver did not reply in time, see:
# tcpdump -n -i lo port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
06:23:07.378225 IP 127.0.0.11.53 > 127.0.0.1.44276: 38280 1/0/0 A 172.18.0.4 (60)
06:23:07.379701 IP 127.0.0.11.53 > 127.0.0.1.44276: 38536 0/0/0 (31)
06:23:19.938697 IP 127.0.0.11.53 > 127.0.0.1.46712: 64518 1/0/0 A 172.18.0.4 (60)
06:23:19.938771 IP 127.0.0.11.53 > 127.0.0.1.46712: 64518 0/0/0 (31)
06:23:20.940249 IP 127.0.0.11.53 > 127.0.0.1.34752: 16399 0/0/0 (31)
06:23:20.940414 IP 127.0.0.11.53 > 127.0.0.1.34752: 16399 1/0/0 A 172.18.0.4 (60)
06:23:25.946796 IP 127.0.0.11.53 > 127.0.0.1.34752: 16399 1/0/0 A 172.18.0.4 (60)
06:23:25.948185 IP 127.0.0.11.53 > 127.0.0.1.34752: 16399 0/0/0 (31)
and corresponding snippet output:
$ time curl elasticsearch:9200 > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 336 100 336 0 0 164k 0 --:--:-- --:--:-- --:--:-- 164k
real 0m0.008s
user 0m0.005s
sys 0m0.001s
$ time curl elasticsearch:9200 > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 336 100 336 0 0 65 0 0:00:05 0:00:05 --:--:-- 83
real 0m5.108s
user 0m0.006s
sys 0m0.009s
I'm having the same problem with a different container/image, using M1 Mac.
In my case I have a two containers - a REST server (Django) and a Database (postgres). When using the hostname assigned by docker-compose to access the database from inside the container of the REST server, random 5sec hangs occur.
When I change the configuration to use the IP address instead of the hostname, the hanging does not occur.
@sam0delkin thanks for pointing out that it might be a DNS resolution problem - it helped me a lot that I could "hotfix" this by using the IP for now :)
If there is any other info needed that could help fix this, please let me know!
I'm experiencing this same issue. Connections from my PHP application devcontainer to mysql and redis, when using the names assigned by docker-compose, experience 5 second delays, and in general are just very much slower (requests taking 1-3 seconds usually). With a PHP project, I suspect since PHP-FPM uses worker processes to handle each request (no DNS cache, maybe?), the project feels VERY sluggish.
The workaround suggested by @Zajozor which was to change our environment variables to connect to the docker mysql and redis services via IP instead of host name worked amazingly well. I got the IPs by doing curl -v mysql
and curl -v redis
inside our dev container.
I'm experiencing this same issue. Connections from my PHP application devcontainer to mysql and redis, when using the names assigned by docker-compose, experience 5 second delays, and in general are just very much slower (requests taking 1-3 seconds usually). With a PHP project, I suspect since PHP-FPM uses worker processes to handle each request (no DNS cache, maybe?), the project feels VERY sluggish.
The workaround suggested by @Zajozor which was to change our environment variables to connect to the docker mysql and redis services via IP instead of host name worked amazingly well. I got the IPs by doing
curl -v mysql
andcurl -v redis
inside our dev container.
i have the same (network/DNS/latency) issue with PHP and a microservice setup. changing urls to IPs does improve the situation, but isnt really fitting into my workflow. atm im the only one on my team with a m1 mac, that makes it harder to argue to change the projectstructure for 1 case where docker is failing its own purpose. the quicker and nonworkflowbreaking solution to me was to retrieve a cheap x86 MBA and wait for a fix of docker. this issue is building up frustration for me, a fix on the dockerside to achieve x86 performance networkwise would be appreciated.
I’m scratching my head over this. How is this bug not fixed after such a long time and not even addressed by the staff in the comments? Sorry for being dramatic but this makes docker for m1 almost useless, at least til you know that DNS is broken and change everything to IP. Running my acceptance tests on mac intel takes 3 minutes on my maxed m1 it takes 30min.
I’m scratching my head over this. How is this bug not fixed after such a long time and not even addressed by the staff in the comments? Sorry for being dramatic but this makes docker for m1 almost useless, at least til you know that DNS is broken and change everything to IP. Running my acceptance tests on mac intel takes 3 minutes on my maxed m1 it takes 30min.
the issue lies within the fact that the issue was created after the macbook airs with m1 were released. then the pro and max books came out, and i assume they are to cool for the cheapest apple product and cant reproduce it on their pro and max devices.
and cant reproduce it on their pro and max devices.
Hmm I'm not sure I follow. :) I have a maxed max (64GB ram, 10 core M1 Max) with this issue.
@plufz just to confirm, which docker for mac version are you running?
@plufz just to confirm, which docker for mac version are you running?
Docker Desktop 4.2.0 (Apple Silicon build). Docker version 20.10.10, build b485636.
Update: And the intel mac I'm comparing it to is a Macbook Pro 13" 2018 (A1989). So it's not exactly a monster intel performance wise.
@plufz try the 4.3.0 update. That update solved a couple of my issue last week!
gone from buster-slim
to bullseye-slim
removed the issue for me.
gone from
buster-slim
tobullseye-slim
removed the issue for me.
Thanks for the update!
I've been struggling with this issue myself and had monkey patched the issue with the temp solution suggested by @SwenVanZanten in #5626. But then I tried to do "docker-compose run --rm [service]" in parallel with running containers, I got "Address already in use" error response.
I changed my base image from buster to bullseye as suggested, and now the problem seems to be gone.
gone from
buster-slim
tobullseye-slim
removed the issue for me.Thanks for the update!
I've been struggling with this issue myself and had monkey patched the issue with the temp solution suggested by @SwenVanZanten in #5626. But then I tried to do "docker-compose run --rm [service]" in parallel with running containers, I got "Address already in use" error response.
I changed my base image from buster to bullseye as suggested, and now the problem seems to be gone.
Yup looks likes that solved the problem.
gone from
buster-slim
tobullseye-slim
removed the issue for me.
thank you so much @annemarietannengrund! works!
Issues go stale after 90 days of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
comment.
Stale issues will be closed after an additional 30 days of inactivity.
Prevent issues from auto-closing with an /lifecycle frozen
comment.
If this issue is safe to close now please do so.
Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale
/remove-lifecycle stale
We encountered the same problem, and the workaround as described here also works for us.
There hasn't been any activity on this issue for a long time. If the problem is still relevant, add a comment on this issue. If not, this issue will be closed in 30 days.
Mark the issue as fresh with a /remove-lifecycle stale
comment.
Stale issues will be closed after an additional %v days of inactivity.
Prevent issues from auto-closing with a /lifecycle frozen
comment.
/lifecycle stale
This is still relevant.
/remove-lifecycle stale
Can replicate this on an M1 Max, macOS 13.0.2, Docker Desktop 4.15.0
5-second delay when connecting to another container by its DNS name. Using the IP eliminates the delay.
Still issue with Docker Desktop v4.15.0 MacOS 12.4 MacBook Air M1 2020 (and with all colleaques macs with different Apple Architecture processors)
still an issue on M1 using Docker 4.17.0
Docker Desktop 4.18.0 2022 M1 Max MacBook Pro macOS 12.6
I was experiencing the same slow and randomly very slow responses. The problems appeared when I swtiched from a Debian 11 based image to one based on Debian 10, and disappeared again when I changed again to a different Debian 11 image:
php:7.3-apache
: (built on bullseye/Debian 11 slim): fast, fine, no problem;php:8.1-apache-buster
: slow, randomly very slow;php:8.1-apache-bullseye
: fast, problem disappeared;It took me a day of fruitless and frustrating tracing until I came across this issue, thank you OP and others for documenting it.
Docker Desktop 4.21.1 (114176) 2022 M2 MacBook Pro macOS 13.4
Looking like the same issue, however with a little variation :
I'm running a website using php-fpm, nginx, redis and mysql. It ran fine on a 2021 M1 MBP. Then I migrated on a 2022 M2 MBP, using the Apple utility for migrating the datas and it suddenly became slow on the new computer.
The fix for me was to replace php:8.1.18-fpm by php:8.1.18-fpm-bullseye in my Docker image. Now it's running fine. Somehow, this issue doesn't seem to affect php:8.1-fpm, my other websites running this image are fast.
This problem has been giving me tons of grief with frequent speedbumps due to these DNS timeouts. Specifically, I'm trying to resolve service hostnames setup by docker compose
that are part of a custom overlay network. I'm also on a Macbook Pro (M3 Max / Nov 2023). Docker version is 4.31.0 (153195)
with Compose: v2.27.1-desktop.1
.
As per earlier comments, upgrading from buster to bullseye does seem to help. In particular, this seems to upgrade from libnss_dns-2.28.so
to libnss_dns-2.31.so
. I tried a bunch of strace
sessions to try to figure out why. The biggest difference I can find is that the newer version uses setsockopt(7, SOL_IP, IP_RECVERR, [1], 4)
which presumably enables better error UDP detection so we don't have to wait the default of 5 seconds for the ppoll
timeouts and can retry faster.
However, despite it's reproducibility (~30-50% of requests) before, after upgrading I have yet to observe a single instance of sending additional UDP messages indicating that this potential faster error recovery is helping. However, for v2.28 it's clear below that after sendmmsg
times out we retry with two individual messages via sendto
and finally succeed.
This still doesn't address what seems to be the core issue: why is the internal Docker DNS resolver so often failing to respond quickly to internal DNS lookups?
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 7
setsockopt(7, SOL_IP, IP_RECVERR, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, 16) = 0
ppoll([...], 1, {tv_sec=0, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=0, tv_nsec=0})
sendmmsg(7, [...], 2, MSG_NOSIGNAL) = 2
ppoll([...], 1, {tv_sec=5, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=999956666})
ioctl(7, FIONREAD, [34]) = 0
recvfrom(7, ""..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 34
ppoll([...], 1, {tv_sec=4, tv_nsec=999000000}, NULL, 0 <unfinished ...>
<... ppoll resumed>) = 0 (Timeout)
<... ppoll resumed>) = 1 ([...], left {tv_sec=4, tv_nsec=998999041})
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, 8) = 0
ioctl(7, FIONREAD <unfinished ...>
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, <unfinished ...>
<... ioctl resumed>, [66]) = 0
<... rt_sigaction resumed>NULL, 8) = 0
recvfrom(7, <unfinished ...>
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, <unfinished ...>
<... recvfrom resumed>""..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 66
<... rt_sigaction resumed>NULL, 8) = 0
close(7 <unfinished ...>
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, 16) = 0
ppoll([...], 1, {tv_sec=0, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=0, tv_nsec=0})
sendmmsg(3, [...], 2, MSG_NOSIGNAL) = 2
ppoll([...], 1, {tv_sec=5, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=999999208})
ioctl(3, FIONREAD, [34]) = 0
recvfrom(3, ""..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 34
ppoll([...], 1, {tv_sec=4, tv_nsec=999000000}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=998999125})
ioctl(3, FIONREAD, [66]) = 0
recvfrom(3, ""..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 66
ppoll([...], 1, {tv_sec=4, tv_nsec=999000000}, NULL, 0 <unfinished ...>
<... ppoll resumed> ) = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, NULL, 8) = 0
ppoll(NULL, 0, {tv_sec=0, tv_nsec=3000000}, NULL, 0) = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, NULL, 8) = 0
... above lines repeat until `tv_nsec` reaches the 5sec timeout
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[PIPE], sa_flags=SA_RESTART}, NULL, 8) = 0
ppoll(NULL, 0, {tv_sec=0, tv_nsec=199000000}, NULL, 0 <unfinished ...>
<... ppoll resumed> ) = 0 (Timeout)
ppoll([...], 1, {tv_sec=0, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=0, tv_nsec=0})
sendto(3, ""..., 34, MSG_NOSIGNAL, NULL, 0) = 34
ppoll([...], 1, {tv_sec=5, tv_nsec=0}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=999998000})
ioctl(3, FIONREAD, [66]) = 0
recvfrom(3, ""..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 66
ppoll([...], 1, {tv_sec=4, tv_nsec=996000000}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=995998333})
sendto(3, ""..., 34, MSG_NOSIGNAL, NULL, 0) = 34
ppoll([...], 1, {tv_sec=4, tv_nsec=995000000}, NULL, 0) = 1 ([...], left {tv_sec=4, tv_nsec=994998583})
recvfrom(3, ""..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, [28->16]) = 34
close(3) = 0
I found that on RC3 build networking is randomly working slow, see these two examples: 1 (this is sequential call of show databases in MySQL which is running ARM version):
So, in some cases additional 5sec delay is adding on the connection time. I thought that problem is in MySQL itself, but i tried same thing with another container (running elasticsearch, on ARM as well): 2 (this is sequential call of just curl default elasticsearch port):
So, here one can see absolutely same situation like with MySQL connection. I thought that problem with DNS, but after i run command
time nslookup elasticsearch
100 times in a row, i didn't see any performance issues with DNS resolver, so looks like problem is in the network itself.Additionally, here my images of MySQL and Elasticsearch: