abiosoft / colima

Container runtimes on macOS (and Linux) with minimal setup
MIT License
19.41k stars 391 forks source link

Docker cannot pull images #137

Closed johannmayer closed 12 months ago

johannmayer commented 2 years ago

Hi,

i just installed colima on a MacBook Pro wit BigSur 11.6.2

colima version 0.3.2
git commit: 272db4732b90390232ed9bdba955877f46a50552

runtime: docker
arch: x86_64
client: v20.10.12
server: v20.10.11

When i want to pull in docker, I get an i/o timeout error. It seems that the colima system doesn't have internet connection.

docker pull maven Using default tag: latest Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io on 192.168.5.3:53: read udp 192.168.5.15:56157->192.168.5.3:53: i/o timeout

Are there any post-install steps to get a connection?

abiosoft commented 2 years ago

are you behind a VPN connection?

johannmayer commented 2 years ago

Yes, i am behind a corporate VPN connection.

spkane commented 2 years ago

I am not on a VPN or using docker with colima, but I see a similar issue:

I get a DNS related error on my first build with nerdctl via containerd after I have started the alpine VM. Simply re-running the command fixes things until I restart the VM.

$ nerdctl build --namespace k8s.io --platform linux/amd64 -t test/test:local -f ./Dockerfile .
[+] Building 0.2s (4/4) FINISHED
...
error: failed to solve: alpine:latest: failed to do request: Head "https://registry-1.docker.io/v2/library/alpine/manifests/latest": dial tcp: lookup registry-1.docker.io on [::1]:53: read udp [::1]:45220->[::1]:53: read: connection refused
FATA[0000] unrecognized image format
FATA[0000] exit status 1

Second Try:

$ nerdctl build --namespace k8s.io --platform linux/amd64 -t test/test:local -f ./Dockerfile .
[+] Building 0.2s (4/4) FINISHED
...
[+] Building 9.7s (7/17)
 => [internal] load build definition from Dockerfile                                  0.1s
 => => transferring dockerfile: 580B                                                  0.1s
 => [internal] load .dockerignore                                                     0.1s
 => => transferring context: 306B                                                     0.1s
 => [internal] load metadata for docker.io/library/alpine:latest                      0.4s
 => [internal] load metadata for docker.io/library/golang:1.17
...
cschmatzler commented 2 years ago

I am running into the same error, without any VPN connection.

❯ colima version
colima version 0.3.2
git commit: 272db4732b90390232ed9bdba955877f46a50552

runtime: docker
arch: aarch64
client: v20.10.10
server: v20.10.11
starvsion commented 2 years ago

I resolved it by doing colima start --port-interface 127.0.0.1

Correction: colima start --port-interface 127.0.0.1 -s

but it fails after pulling in more data

niroowns commented 2 years ago

For those of us behind a VPN, how do I configure docker to use a proxy?

spkane commented 2 years ago

This is a good overview of DNS issues in Alpine and might be at the core of some of these DNS issues:

https://support.cloudbees.com/hc/en-us/articles/360040999471-UnknownHostException-caused-by-DNS-Resolution-issue-with-Alpine-Images

Their main fix was to migrate to RedHat's Universal Base Images (UBI) - https://developers.redhat.com/products/rhel/ubi

There is a workaround as well, that I will try when I have a bit of time to test it.

pensatocriminale commented 2 years ago

I am seeing this issue now too, after it had been working for me initially, e.g. -

% docker pull lscr.io/linuxserver-labs/daedalos
Using default tag: latest
Error response from daemon: Get "https://ghcr.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

and testing on multiple networks.

yoedusvany commented 2 years ago

Same here docker pull hello-world Using default tag: latest error during connect: Post "http://%2FUsers%2Fxxxxxx%2F.colima%2Fdocker.sock/v1.41/images/create?fromImage=hello-word&tag=latest": EOF

AlexLombry commented 2 years ago

Hello, I have this error too : Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io on 192.168.5.3:53: read udp 192.168.5.15:33676->192.168.5.3:53: i/o timeout Sometimes it's a timeout, sometimes another error.

I to install it on a macOS without VPN whatsoever, I don't understand the issue. I've also tested multiple configuration like Rancher desktop, minikube + hyperkit, podman etc and I have this issue only with Colima.

Someone found a solution about that ?

For instance if I run docker run hello-word it's working for almost 30 secondes after the start of colima. And then it crashes and I finally get this error. After that the error happen every times

wolf31o2 commented 2 years ago

It's Alpine. The musl DNS resolver is pretty terrible. It behaves differently from glibc in many ways.

abiosoft commented 2 years ago

It's Alpine. The musl DNS resolver is pretty terrible. It behaves differently from glibc in many ways.

I am just realising this

spkane commented 2 years ago

There are details about this here: https://wiki.musl-libc.org/functional-differences-from-glibc.html#Name-Resolver/DNS

pedantic79 commented 2 years ago

I've been experiencing DNS failures randomly too. Especially, when having many queries in quick succession. Would having a caching dns server sit between the qemu dns and the containers help? I may try to set one up manually to see if it helps the situation.

jandubois commented 2 years ago

I'm not convinced the differences between glibc and musl are the root cause here; unless colima does something different, there should be only a single nameserver in /etc/resolv.conf, and it should point to the lima internal host resolver.

I found one bug with this very recently: we disable IPv6 lookups in Lima by default because they often end up not working. The issue was though that instead of responding with an empty response, we handed the request to the resolver on the host, which might then add some random error for the IPv6 query to our response.

In my specific test case, I got the right DNS information when I looked with nslookup or dig, but curl could not connect. So I guess the musl resolver could share some blame, but the main blame belongs on our own DNS implementation (at least for this particular case).

This should be fixed in the forthcoming lima 0.8.3 release. So I would appreciate if you could all re-test with that version (once released), and report back if this improved/fixed the situation!

abiosoft commented 2 years ago

I'm not convinced the differences between glibc and musl are the root cause here; unless colima does something different, there should be only a single nameserver in /etc/resolv.conf, and it should point to the lima internal host resolver.

This is the case in Colima as well, and the single nameserver is 192.168.5.3.

This should be fixed in the forthcoming lima 0.8.3 release. So I would appreciate if you could all re-test with that version (once released), and report back if this improved/fixed the situation!

Looking forward to it. Thanks.

navels commented 2 years ago

New colima user here, running into this right off the bat. lima version is 0.8.3, colima 0.3.3. This workaround fixed it for me: https://github.com/abiosoft/colima/issues/140#issuecomment-1028395976

pedantic79 commented 2 years ago

I'm not convinced the differences between glibc and musl are the root cause here; unless colima does something different, there should be only a single nameserver in /etc/resolv.conf, and it should point to the lima internal host resolver.

This is the case in Colima as well, and the single nameserver is 192.168.5.3.

This should be fixed in the forthcoming lima 0.8.3 release. So I would appreciate if you could all re-test with that version (once released), and report back if this improved/fixed the situation!

Looking forward to it. Thanks.

@abiosoft Do we need to wait for a colima release for this? Running colima 0.3.3, and lima 0.8.3.

I experience this error:

Unable to connect to the server: dial tcp: lookup private.hostname.from.internal.company.com on 192.168.5.3:53: read udp 172.17.0.2:34738->192.168.5.3:53: i/o timeout

When I go into the VM:

dnn@overwatch ~ » colima ssh
colima:/Users/dnn$ nslookup private.hostname.from.internal.company.com
;; connection timed out; no servers could be reached

This happens because I'm running a script that is doing the same lookup over and over again very quickly. If I stop for a few minutes and try again, the DNS lookup is okay.

abiosoft commented 2 years ago

@pedantic79 a lima upgrade should be all that is required.

For troubleshooting purposes, can you kindly try this https://github.com/abiosoft/colima/issues/140#issuecomment-1028395976 and see if the behaviour is different? Note that it requires recreating the VM to see the effect i.e. colima delete (if exits) prior to starting.

rahul286 commented 2 years ago

I also faced the same issue but its resolved by specifying DNS resolver

colima start --dns 1.1.1.1
pedantic79 commented 2 years ago

@abiosoft Yes that seems to fix things. I ended up using 192.168.5.2, the host, since work runs a dns proxy on my laptop. This way I can resolve private addresses not on the public DNS.

abiosoft commented 2 years ago

Can anyone try the lastest development version and see if anything changes?

brew install --HEAD colima
navels commented 2 years ago

Nope. A reasonable test for me is to download a large-ish (~1.5 GB) image:

docker image rm localstack/localstack
docker pull localstack/localstack:latest

which will get part of the way through and then stall:

Using default tag: latest
latest: Pulling from localstack/localstack
69bf0018a85c: Pull complete
d99d2ad45cad: Pull complete
2f5e7e852b75: Pull complete
9bdba4da0515: Pull complete
6d148a48367a: Pull complete
4f136f6bab8f: Pull complete
abd3b9714a4d: Pull complete
50eebec84093: Pull complete
a7f30185d16d: Pull complete
a0e7ef63792a: Pull complete
6e070eb76685: Pull complete
6fb969c1cc11: Pull complete
6b72ad47a399: Pull complete
5a968b0e80e9: Pull complete
4f4fb700ef54: Pull complete
f7deb66a5a33: Pull complete
318d55565698: Pull complete
565ac449cbaa: Pull complete
973b9108c62f: Pull complete
abe7f386e549: Pull complete
6af74865c5fb: Pull complete
b4ff06af1df8: Pull complete
b93bdfca7413: Pull complete
6e0f2f6fe87b: Pull complete
348542de0a59: Pull complete
338328b1acd7: Pull complete
343ae7575c43: Retrying in 1 second
ecaf8f60df9e: Retrying in 1 second
c01474015845: Retrying in 1 second
31c659c48f0f: Waiting
b146a65269aa: Waiting
b19b566fb94a: Waiting

and subsequent attempts:

Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io on 192.168.5.3:53: read udp 192.168.5.15:56456->192.168.5.3:53: i/o timeout

making me wonder if I am getting throttled or running out of sockets or something.

Using docker desktop this pull is a breeze.

abiosoft commented 2 years ago

@navels l'd be interested in knowing if there are any specifics to your network connection as I am struggling to reproduce this. I do get Retrying in x secs once in a while but the retries are successful and it never gets bad enough for the image pulling to terminate.

Can you kindly share the output of colima version ?

Thanks.

DannyAtDejero commented 2 years ago

@abiosoft I'm seeing the same timeout and lookup failure as @navels, only in my case it was triggered by pushing a number of images in quick succession instead of pulling a single large one. I've confirmed that docker pull localstack/localstack:latest often fails with endless retry messages for me as well.

% colima version
colima version HEAD-5e2e413
git commit: 5e2e41310e595553dcdc29ba45827d4030af37bb

Other details that might be helpful:

Ping output from within the VM used to be very strange with a constantly increasing round trip and DUP packets, but that appears to be fixed in this latest version. 👍

navels commented 2 years ago
> colima version
colima version HEAD-5e2e413
git commit: 5e2e41310e595553dcdc29ba45827d4030af37bb

runtime: docker
arch: aarch64
client: v20.10.13
server: v20.10.11

I have this problem at home and at work, on and off VPN. This is on an M1 Mac Pro. Network speeds are about the same at both locations: ~300 Mbps.

Aha . . . I just tried a few different configurations and it seems to happen with more CPUs. With 1-2 CPUs I didn't have any issues. With 3 I do. My normal configuration is 8 CPUs.

Double-checked my docker desktop config: 8 CPUs.

jasoncodes commented 2 years ago

I’ve ran into these DNS issues too and I’ve found changing my DNS to use the gateway of the VDE network works well for me. If you want to see if this workaround will work for you too, try running the following before your test:

colima ssh -- sudo sh -c 'echo nameserver 192.168.106.1 > /etc/resolv.conf'

This temporary patch can be reverted by restarting colima or running the above again with 192.168.5.3. I have the following in ~/.lima/_config/override.yaml to make this change persistent:

useHostResolver: false
dns:
  - 192.168.106.1
navels commented 2 years ago

Yep, yep, there are workarounds, just trying to help @abiosoft troubleshoot.

spkane commented 2 years ago

I am also still seeing issues with the use case that I reported in https://github.com/abiosoft/colima/issues/137#issuecomment-1018721366

The first time I run something like:

nerdctl build --namespace k8s.io --platform linux/amd64 -t test/test:local -f ./Dockerfile . it fails with:

After another one or two tries (so likely after some short amount of time from the first attempt) it works and then continues to work.

abiosoft commented 2 years ago

@spkane can you try the last development version brew install --head colima and see if that improves anything?

abiosoft commented 2 years ago

@navels you likely weren't running colima with vde networking enabled as the fix for m1 devices just got pushed. Can you try installing again brew install --HEAD colima and get rid of /opt/colima with sudo rm -rf /opt/colima.

Does that change anything?

navels commented 2 years ago

Unfortunately no change, fails with 3 CPUs.

colima version HEAD-3fc20b2
abiosoft commented 2 years ago

@navels are you able to see the IP address in the output of colima ls?

navels commented 2 years ago

Yep: 192.168.106.2

ramunasd commented 2 years ago

@abiosoft The latest HEAD has much more stable network on apple M1 CPU, with 4 cores enabled, although wrong DNS issue is still present.

colima version HEAD-37a6de0
git commit: 37a6de0ef4fe631c7b34e69697c5234a9cdd5541

runtime: docker
arch: aarch64
client: v20.10.14
server: v20.10.11
cognifloyd commented 2 years ago

Does anyone have Cisco AnyConnect installed?

I have an intel mac that I just upgraded from Catalina to Monterey. Since the upgrade, I've been experiencing various network timeouts, but the dns issues in colima were the most pronounced as they blocked my use of docker pull. Outside of Colima, git was often hanging as well, so I didn't think it was a uniquely colima issue, so I kept looking after I found this issue.

I have Cisco AnyConnect installed which I occasionally use to connect to a VPN. After the Monterey update, "Cisco AnyConnect Socket Filter" showed up and asked for permission to run a new SystemExtension. I allowed it at that point, but I think that was the culprit behind all my network issues. Here are some other issues people experienced with it: https://apple.stackexchange.com/questions/420773/the-process-com-cisco-anyconnect-macos-acsockext-hogs-mac-cpu-but-cannot-be-kill

This service is suspicious (to me) because its "features" are (based on the docs):

So, I just deleted Cisco AnyConnect Socket Filter (deleted it from the Applications) which removed the SystemExtension. And, I stopped its annoying "notification" service from pestering me about it on reboot.

$ launchctl blame cisco
// this prints a list the services. You want the gui/...cisco.anyconnect.notification... one.
$ launchctl disable gui/<number>/application.com.cisco.anyconnect.notification.<number>.<number>
$ launchctl stop gui/<number>/application.com.cisco.anyconnect.notification.<number>.<number>
$ launchctl kill 9 gui/<number>/application.com.cisco.anyconnect.notification.<number>.<number>

After doing all of that (and another reboot), dns works in colima again!

navels commented 1 year ago

I stopped using colima a while ago but just tried this again and am not getting the errors, so either fixed in colima or the Mac networking stack (Sonoma on an M1 Pro).