bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.78k stars 519 forks source link

Unable te resolve local domains when upgrading Kubernetes from v1.27 to v1.28 #4217

Closed rmdvb closed 1 month ago

rmdvb commented 1 month ago

Image I'm using: Currently using BottleRocket v1.21.1

What I expected to happen: To be able to resolve <company>.local domains, these are different domains than the AWS default: <region>.compute.internal

settings:

[root@admin]# cat etc/resolv.conf
# This is /run/systemd/resolve/stub-resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad
search <region>.compute.internal
[root@admin]# sudo chroot /.bottlerocket/rootfs resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (eth0)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.xx.xx.2
       DNS Servers: 10.xx.xx.2
        DNS Domain: <region>.compute.internal

Before using WickeD we were able to resolve <company>.local domains.

What actually happened: I'm not able to resolve <company>.local domains, I can however resolve other domains, including the local AWS domain <region>.compute.internal. This gives issues pulling images from our private image repository.

How to reproduce the problem: This happened when upgrading kubernetes to v1.28. As I understand this might be due to the change from WickeD to ResolveD.

koooosh commented 1 month ago

Hello, thanks for reporting this issue. Can you please share your user data settings (or a redacted version if necessary)?

Specifically, I'm curious what your dns and network configurations look like.

One thing to note is that for variants using systemd-networkd (*-k8s-1.28-* and *-ecs-2-* and newer), resolv.conf exists in the path/run/systemd/resolve/resolv.conf which can be accessed on the host using sheltie.

rmdvb commented 1 month ago

Hi @koooosh, Thanks for your reply! Currently we are using these settings:

[settings.pki.<company>-root-ca]
data="""<company-root-CA>"""
trusted=true
[settings.dns]
name-servers = ["10.xx.xx.2"]
search-list = ["<region>.compute.internal" , "<company>.local"]

At first this seems to work however when the node is pulling images we see it gives an error: dial tcp: lookup <repository>.<company>.local: Temporary failure in name resolution. We see this on new and on older nodes, occasionally it seems to be able to resolve the name and actually pull the container. (This usually takes ~15 minutes per container)

rmdvb commented 1 month ago

We seem to have figured out the solution to our problem. Sharing here as it might be useful for other users. The <repository>.<company>.local name is a cname to another domain <machine>.aws.local. ResolveD couldn't figure out this second step as it didn't know what DNS to query it at.

The fix for us was to update the search-list to: search-list = ["<region>.compute.internal" , "local"]

yeazelm commented 1 month ago

Thanks for following up @rmdvb!