DNS names that can't be resolved in Colima, possibly only with gvproxy network driver

rfay commented 1 year ago

Description

I'm starting this issue so we can start to track down the specific DNS addresses that fail in colima/lima, and the sources of information. I get this question all the time, and tell people to use --dns 1.1.1.1 and it almost always fixes. But I think we should start to track what they are so maybe we can solve this someday.

Issue	hostname
https://github.com/drud/ddev/issues/4372	mavtek-840225427682.d.codeartifact.us-east-1.amazonaws.com
https://github.com/drud/ddev/issues/4413	www.youtube.com (seems to be youtube-ui.l.google.com)
https://github.com/abiosoft/colima/issues/466#issuecomment-1327977342	test12345.s3.ap-northeast-1.amazonaws.com

Version

Colima Version: Various Lima Version: Qemu Version:

Operating System

[X] macOS Intel
[X] macOS M1
[ ] Linux

Workarounds

Many people have reported in the comments that changing to the slirp network driver resolved the issue.

paihu commented 1 year ago

I have the same issue.

Reproduction Steps

docker network create test
docker run --rm -it --network test alpine
apk add --no-cache curl && curl test12345.s3.ap-northeast-1.amazonaws.com
docker network rm test

Not Reproduction Steps

docker network create test
docker run --rm -it --network test alpine
apk add --no-cache curl && curl test1234.s3.ap-northeast-1.amazonaws.com
docker network rm test

Reproduce if fqdn is more than 41 characters and non default docker network

workaround

In my case...

open ~/.colima/default/colima.yml

edit network.driver

network:
  driver: slirp

rfay commented 1 year ago

Added youtube.com to the list in OP

renatho commented 1 year ago

Added youtube.com to the list in OP

Notice that youtube.com worked for me, but www.youtube.com didn't work. 😉

rfay commented 1 year ago

Edited, thanks @renatho

abiosoft commented 1 year ago

If indeed using slirp as the network driver fixes it, this should be resolved by the next release v0.5.0.

Schrank commented 1 year ago

I can add sbp-plugin-binaries.s3.eu-west-1.amazonaws.com

abiosoft commented 1 year ago

I would like to know if this is still the case for v0.5.0.

paihu commented 1 year ago

Thanks.

Fixed in my environment.

adrienthebo commented 1 year ago

I've observed sporadic failures with golang.org; I'm running on a 2021 Mac M1 Silicon using the vz virtualization driver. This manifests when using the devcontainer cli to build workspace images.

 $ yq '.network.driver' "$(colima template --print)"
gvproxy

$ colima version
colima version 0.5.2
git commit: 6b5b6fe0540e708f0c9d6e8919fab292c671fc72

runtime: docker
arch: aarch64
client: v23.0.1
server: v20.10.20

taylorchu commented 1 year ago

this is still not fixed in 0.5.4

abiosoft commented 1 year ago

I got bitten by this today as well and I can confirm it only happens with gvproxy network.

It appears some DNS queries fail for whatever reason.

I am still investigating.

gpsa commented 1 year ago

Same here:

When I:

nslookup test.s3-website-us-east-1.amazonaws.com

Server:     192.168.107.1
Address:    192.168.107.1:53

Non-authoritative answer:

**server can't find test.s3-website-us-east-1.amazonaws.com: NXDOMAIN**

But if I use Google's 8.8.8.8:

nslookup  test.s3-website-us-east-1.amazonaws.com 8.8.8.8
Server:     8.8.8.8
Address:    8.8.8.8:53

Non-authoritative answer:
test.s3-website-us-east-1.amazonaws.com canonical name = s3-website.us-east-1.amazonaws.com

Non-authoritative answer:
test.s3-website-us-east-1.amazonaws.com canonical name = s3-website.us-east-1.amazonaws.com
Name:   s3-website.us-east-1.amazonaws.com
Address: 52.217.87.195
Name:   s3-website.us-east-1.amazonaws.com
Address: 52.216.27.3
Name:   s3-website.us-east-1.amazonaws.com
Address: 52.216.98.42
Name:   s3-website.us-east-1.amazonaws.com
Address: 52.216.243.67
Name:   s3-website.us-east-1.amazonaws.com
Address: 52.217.140.13
Name:   s3-website.us-east-1.amazonaws.com
Address: 52.216.57.53
Name:   s3-website.us-east-1.amazonaws.com
Address: 52.217.10.139
Name:   s3-website.us-east-1.amazonaws.com
Address: 52.217.137.173

If I change nw driver for: slirp then now the problem is that host.docker.internal is being resolved via /etc/hosts but I need to be resolved via DNS Lookup:

nslookup host.docker.internal
Server:     127.0.0.11
Address:    127.0.0.11:53

** server can't find host.docker.internal: NXDOMAIN

** server can't find host.docker.internal: NXDOMAIN

abiosoft commented 1 year ago

If I change nw driver for: slirp then now the problem is that host.docker.internal is being resolved via /etc/hosts but I need to be resolved via DNS Lookup

@gpsa can you kindly open another issue for this? This is likely a bug.

gpsa commented 1 year ago

If I change nw driver for: slirp then now the problem is that host.docker.internal is being resolved via /etc/hosts but I need to be resolved via DNS Lookup

@gpsa can you kindly open another issue for this? This is likely a bug.

I could, but just to clarify, is the slirp driver expected to resolve host.docker.internal via DNS Lookup?

abiosoft commented 1 year ago

I could, but just to clarify, is the slirp driver expected to resolve host.docker.internal via DNS Lookup?

@gpsa I suspect your issue was changing the network driver of an existing VM.

This is what I get for slirp, it uses DNS lookup as well.

nslookup host.docker.internal
Server:     192.168.5.3
Address:    192.168.5.3:53

Non-authoritative answer:
Name:   host.docker.internal
Address: 192.168.5.2

Non-authoritative answer:

gpsa commented 1 year ago

I could, but just to clarify, is the slirp driver expected to resolve host.docker.internal via DNS Lookup?

@gpsa I suspect your issue was changing the network driver of an existing VM.

This is what I get for slirp, it uses DNS lookup as well.
nslookup host.docker.internal
Server:       192.168.5.3
Address:  192.168.5.3:53

Non-authoritative answer:
Name: host.docker.internal
Address: 192.168.5.2

Non-authoritative answer:

@abiosoft Is there a way to recreate it without destroying everything? I could try to see if by recreating would work

abiosoft commented 1 year ago

@abiosoft Is there a way to recreate it without destroying everything? I could try to see if by recreating would work

@gpsa yeah. It's a regression actually, used to work before. You can edit the /etc/resolv.conf file in the VM and set the nameserver IP to 192.168.5.3.

In fact, it is the only entry in the file so you can simply replace it

colima ssh -- sudo sh -c 'echo "nameserver 192.168.5.3" > /etc/resolv.conf'

gpsa commented 1 year ago

@abiosoft Is there a way to recreate it without destroying everything? I could try to see if by recreating would work

@gpsa yeah. It's a regression actually, used to work before. You can edit the /etc/resolv.conf file in the VM and set the nameserver IP to 192.168.5.3.

In fact, it is the only entry in the file so you can simply replace it
colima ssh -- sudo sh -c 'echo "nameserver 192.168.5.3" > /etc/resolv.conf'

@abiosoft thank you so much, that worked like a breeze. Now both internal Docker DNS and external domains work just fine on SLIRP.

henrik242 commented 1 year ago

Could the DNS issues somehow be related to Alpine?

From https://martinheinz.dev/blog/92:

Usually, you would not notice this difference, because most of the time a single UDP packet (512 bytes) is enough to resolve hostnames... until it isn't enough and your application (running on Kubernetes) that previously worked completely fine for months suddenly starts throwing "Unknown Host" exceptions for one particular (very critical) hostname. The worst part is that this can manifest randomly, anytime when some external network change causes the resolution of some particular domain to require more than the 512 bytes available in single UDP packet.

abiosoft commented 1 year ago

Could the DNS issues somehow be related to Alpine?

@henrik242 I have actually read something similar before but I do not think this situation is related to Alpine, considering that slirp works fine.

As for why Alpine is the choice for Colima, you can check this comment https://github.com/abiosoft/colima/issues/291#issuecomment-1131229618.

gpsa commented 1 year ago

Could the DNS issues somehow be related to Alpine?

@henrik242 I have actually read something similar before but I do not think this situation is related to Alpine, considering that slirp works fine.

As for why Alpine is the choice for Colima, you can check this comment #291 (comment).

SLIRP mode is now "crashing" the same way Lima alone was behaving. So, basically the mounting points stop working and: On the Host

docker ps
Cannot connect to the Docker daemon at unix:///Users/user/.colima/default/docker.sock. Is the docker daemon running?

rfay commented 1 year ago

@gpsa you're making a bit of a mess of this issue. Could you please open one that's on-topic for your issues?

gpsa commented 1 year ago

@gpsa you're making a bit of a mess of this issue. Could you please open one that's on-topic for your issues?

Sorry about that, I've then created a separated issue for the SLIRP one

AndreasA commented 1 year ago

when starting colima with VZ vmtype and virtiofs and providing --dns 192.168.5.3 then AWS hostname resolution seems to fail as well. without it seems to work but results in the pulling speed issues https://github.com/abiosoft/colima/issues/648 - no matter if slirp or gvproxy is used though i think for VZ vm type the network driver setting is probably ignored..

taylorchu commented 1 year ago

@abiosoft https://wiki.musl-libc.org/functional-differences-from-glibc.html

Multiple reports on weird musl dns incompatibility with glibc. I think it is safer to use base image like debian for this.

ryancurrah commented 1 year ago

I would like to use Debian as well to see if it resolves this issue for us. Is that possible?

gchait commented 1 year ago

After some messing around, this seems to be the fix:

colima delete
colima start --edit

Change gvproxy to slirp. With such a limitation/bug, I wonder why it's not the default.

mandrasch commented 1 year ago

If anyone wants to switch, the following should also possible

colima start --edit
# change value with "i" insert mode, switch to slirp
# save via ":wq:"

Or edit ~/.colima/default/colima.yaml and re-start colima via colima stop and colima start.

No need for colima delete (as far as I know).

gchait commented 1 year ago

If anyone wants to switch, the following should also possible
colima start --edit
# change value with "i" insert mode, switch to slirp
# save via ":wq:"
Or edit ~/.colima/default/colima.yaml and re-start colima via colima stop and colima start.

No need for colima delete (as far as I know).

For me, after simply restarting nothing seemed to be working. To be more specific, a docker build failed right at the beginning, because it could not even resolve registry-1.docker.io. It was an i/o timeout right there, suggesting all/most networking was broken in the VM. I got the idea for the delete from here.

mandrasch commented 1 year ago

Hi! I started with colima version 0.5.5 two months ago and changing the config + restart worked fine for me today (without deleting).

@rfay just mentioned in DDEV discord the following:

If you have had your colima instance through many updates, it's a worthwhile thing to delete it and recreate it. (After saving away databases of course via ddev snapshot -a)

So depends on how many updates happened in the meantime I guess?

AndreasA commented 1 year ago

Change gvproxy to slirp. With such a limitation/bug, I wonder why it's not the default.

Just wondering, but colima start --network-driver slirp should work as well, shouldn't it? It would be easier to use in a command for setup (no need to search/replace in the config file).

Though the last time I tried it, it made no difference with virtual machine type vz, but I admit I did not delete the instance, so maybe that helps, though not sure if the network driver is even relevant for vz but it is worth a try.

mandrasch commented 1 year ago

Just wondering, but colima start --network-driver slirp should work as well, shouldn't it? It would be easier to use in a command for setup (no need to search/replace in the config file).

Does this replace and save things in the current configuration before starting? Would be cool! (I'll try later, thanks for hint).

skirsdeda commented 1 year ago

I hit this while running a container which does a lot of AWS service requests. DNS resolution would fail after some time when using vz vm, then subsequent run would fail almost immediately and only colima restart helped to get more time without DNS failures. And with qemu and slirp network driver it was actually even worse. So I resorted to Docker Desktop which runs without problems. Sad.

admxxi commented 1 year ago

Same here, having issue while using vmType: vz and network drivers gvproxy or slirp still getting loads of error while trying to solve DNS, but I would say 50% of the requests fail.

AndreasA commented 12 months ago

Hi, just wondering but which lima version are you using because https://github.com/abiosoft/colima/issues/648 seems to be fixed - at least it looks like it so far - with the latest lima 0.18.x update and it was related to DNS as well, so it might also fix these issues?

jdmarshall commented 10 months ago

I'm still getting connection refused on 127.0.0.11

I don't have a local dns server on dev machines and I can't figure out what the solution is here. How do we avoid this?

The latest version of Colima doesn't even have a driver field in the yaml file and I'm still having this problem.

xuwhite commented 2 months ago

I'm still getting connection refused on 127.0.0.11

I don't have a local dns server on dev machines and I can't figure out what the solution is here. How do we avoid this?

The latest version of Colima doesn't even have a driver field in the yaml file and I'm still having this problem.

same here sadly the only workaround that works for me is to add a dns address with colima start --dns 8.8.8.8 or in the config file ~/.colima/default/colima.yaml if the dns changes I have to restart colima colima restart to make the dns work again see #711

abiosoft / colima