cirruslabs / orchard

Orchestrator for running Tart Virtual Machines on a cluster of Apple Silicon devices
Other
194 stars 16 forks source link

Unable to resolve DNS via host's resolver #92

Closed ruimarinho closed 1 year ago

ruimarinho commented 1 year ago

Hi,

I hesitated in creating this issue here but due to the unique setup of the base image, this problem might be unique to orchard.

I'm trying to get VMs launched by orchard to be able to resolve DNS names via the host's own DNS resolver instead of the ones defined on the base image (8.8.8.8, etc). The reason for this requirement is that the host's DNS resolver is a special type of DNS server that is capable of resolving privately-resolved DNS entries (e.g. foo.bar.internal) inside the VPN. Anything that it can't resolve it sends to an upstream internet-facing server.

So the host has something like this under its /etc/resolv.conf:

nameserver 127.0.2.2
nameserver 127.0.2.3

When a VM boots, I'm updating its DNS servers to the host IP (192.168.64.1 via networksetup -setdnsservers Ethernet "Empty") using the startup launch script, but the VM is still unable to route traffic to the host's resolver.

I've launched an HTTP server on the host on a high port and I've confirmed there is connectivity between the host and the VM, but any attempt at doing nslookup google.com 192.168.64.1 times out.

Another option I've tried is running coredns on the host bound to the bridge100 network interface only and the forwarding all traffic to the localhost server. This approach works but there is some fighting between coredns and the VPN DNS resolver which ultimately results in a crash.

I suspect it's because the VPN DNS resolver is not listening on all interfaces (checked via sudo lsof -i -P | grep LISTEN), but ideally we could masquerade the VM NAT traffic to be able to reach the host's local DNS server.

Any help would be greatly appreciated.

edigaryev commented 1 year ago

At a first glance, the issue doesn't seem to be related to Orchard nor Tart, more to the macOS innerworkings of Virtualization.Framework and mDNSResponder(8).

I'm afraid that more details are needed about the VPN solution used and it's configuration to at least reproduce your issue.

Looking into tcpdump/Wireshark for the network interfaces in question may give some clues.

Also, check out the https://github.com/cirruslabs/tart/issues/473#issuecomment-1516580781: when the VPN solution uses macOS APIs, Tart VMs can access the these VPN networks automatically, otherwise additional changes with pfctl(8) are required.

A wild guess is that using 127.0.0.0/8 range is what causing the problems, perhaps switching to Private-Use networks might solve the issue.

ruimarinho commented 1 year ago

I think you should be able to replicate this quickly using Cloudflare WARP (https://1.1.1.1/).

edigaryev commented 1 year ago

I've checked out the Cloudflare WARP and I'm afraid that it's the cause of the DNS resolution problems you're seeing.

Here's a simple experiment to show that:

  1. Reboot macOS, just in case, to reset the networking state that might've been previously altered by Cloudflare WARP. Make sure that Cloudflare WARP won't get enabled automatically after reboot.
  2. After booting, you'll observe the following ports opened by mDNSResponder(8) (this is the daemon that enables DNS resolution for Virtualization.Framework VM's):
    # sudo lsof -p $(pgrep -u _mdnsresponder mDNSResponder)  | grep UDP
    mDNSRespo 459 _mdnsresponder    6u     IPv4 0x91c412df74026bd9      0t0                 UDP *:mdns
    mDNSRespo 459 _mdnsresponder    7u     IPv6 0x91c412df74026fd9      0t0                 UDP *:mdns
  3. Start any Tart VM (with no specific tart run arguments), and then check the ports opened by mDNSResponder(8) again:

    # sudo lsof -p $(pgrep -u _mdnsresponder mDNSResponder)  | grep UDP
    mDNSRespo 459 _mdnsresponder    6u     IPv4 0x91c412df74026bd9      0t0                 UDP *:mdns
    mDNSRespo 459 _mdnsresponder    7u     IPv6 0x91c412df74026fd9      0t0                 UDP *:mdns
    mDNSRespo 459 _mdnsresponder   42u     IPv4 0x91c412df7454b3d9      0t0                 UDP *:domain
    mDNSRespo 459 _mdnsresponder   43u     IPv6 0x91c412df7402f7d9      0t0                 UDP *:domain

    You can see that it got reconfigured, and this effectively enables DNS resolution for VMs.

  4. Now, for the interesting part: enable Cloudflare WARP. It won't work, saying that "port 53 is bound":
Screenshot 2023-06-23 at 16 28 24
  1. Stop the VM, enable Cloudflare WARP, it will work again:
Screenshot 2023-06-23 at 16 29 10
  1. Now start the VM again, mDNSResponder(8)'s UDP sockets will now look like this:

    # lsof -p $(pgrep -u _mdnsresponder mDNSResponder)  | grep UDP
    mDNSRespo 490 _mdnsresponder    6u     IPv4 0x1ed4587dbcccf7af      0t0                 UDP *:mdns
    mDNSRespo 490 _mdnsresponder    7u     IPv6 0x1ed4587dbcccfbaf      0t0                 UDP *:mdns
    mDNSRespo 490 _mdnsresponder   45u     IPv4 0x1ed4587dba354baf      0t0                 UDP *:*
    mDNSRespo 490 _mdnsresponder   46u     IPv6 0x1ed4587dba354faf      0t0                 UDP *:*

    This is obviously broken, because VM won't be able to reach mDNSResponder(8) on port 53 with such configuration.

    If you try to resolve something inside of a VM now, you will observe the following packets in tcpdump:

    16:19:56.570536 IP 192.168.64.1 > 192.168.64.13: ICMP 192.168.64.1 udp port 53 unreachable, length 36
  2. Even more interesting: after disabling the Cloudflare WARP, if you try to start any Tart VM, you'll get the following Virtualization.Framework error:

    virtual machine's network attachment <VZNetworkDevice: 0x6000037962b0> has been disconnected with error: Error Domain=VZErrorDomain Code=1 "Internal Network Error." UserInfo={NSLocalizedFailure=Internal Virtualization error., NSLocalizedFailureReason=Internal Network Error.}

I'm not sure how to easily fix this, because both components macOS (and it's mDNSResponder(8)) and Cloudflare WARP are not open-source, but can use this post's content to report the issue the Cloudflare if you want.

edigaryev commented 1 year ago

Closing because there's nothing we can do to fix the problem on our side.