docker / for-mac

Bug reports for Docker Desktop for Mac
https://www.docker.com/products/docker#/mac
2.44k stars 118 forks source link

17.03.0-ce-mac2 changed DNS proxy for VPN split tunnel behavior. #1397

Closed matthewbarr closed 7 years ago

matthewbarr commented 7 years ago

Expected behavior

In 17.03.0-ce-mac2, containers doing DNS lookup should receive the same DNS for the same address as a DNS request from the host Mac, when running split tunnel VPN. This behavior occurred in 1.13.1, prior to upgrade.

Expectation is that both Host mac & docker containers will use the same DNS server, even when on VPN.

Actual behavior

In Docker for Mac 1.13.1, DNS followed the correct VPN lookup for DNS. In 17.03.0-ce-mac2, containers get an external DNS resolution, vs internal DNS servers, from a split tunnel VPN. (Pulse Secure)

Docker containers can ping and perform lookups against the internal DNS server, but are not using it as default DNS resolver.

This was all done with VPN up the entire time, before Hyperkit / Moby is launched. Also attempted a restart of Moby, to see if that helps.

Information

Docker for Mac: version: 17.03.0-ce-mac2 (1d7d97bbb) macOS: version 10.11.6 (build: 15G1217) logs: /tmp/33D5EBFD-8496-474C-9D81-17B8E56BA052/20170306-144946.tar.gz [OK] vmnetd [OK] dns [OK] driver.amd64-linux [OK] virtualization VT-X [OK] app [OK] moby [OK] system [OK] moby-syslog [OK] db [OK] env [OK] virtualization kern.hv_support [OK] slirp [OK] osxfs [OK] moby-console [OK] logs [OK] docker-cli [OK] menubar [OK] disk

33D5EBFD-8496-474C-9D81-17B8E56BA052

Steps to reproduce the behavior

For use on the example.com internal split tunnel VPN:

On Mac:

  1. dig foo.example.com return: 172.27.1.10 Server: 172.27.1.2

On docker container: dig foo.example.com return: 121.1.1.5 Server: 192.168.65.1

Yes, I recognize that 192.168.65.1 is likely somewhere in the Docker for Mac ecosystem, but I can't seem to find that IP bounding. My guess is it's a proxy for the Mac, but it seems to be sending the request to the wrong DNS server.

Any way to determine more info on the internals of what the DNS proxy is using?

djs55 commented 7 years ago

Thanks for the report. You're right that 192.168.65.1 is an internal proxy IP. To see the configuration try

$ cd ~/Library/Containers/com.docker.docker/Data/database/
$ git reset --hard
HEAD is now at a78143e last-start-time changed at 1488835201
$ cat com.docker.driver.amd64-linux/slirp/dns
# { Addresses: 192.168.1.2; Order: 200000; Zones:  }
nameserver 192.168.1.2
order 200000
timeout 2000

One possible difference is that the interpretation of SupplementalMatchDomains in the Apple System Configuration database was tweaked-- perhaps this had some bad side-effects.

matthewbarr commented 7 years ago

Hmm.. It had a few copies of different versions of the DNS, and I suspect it was choosing a different set. I cleared it up, and committed it. All seems OK now.

What kind of things trigger a recreation/ update of the file? We already use some edits to insert client certificates into the Moby VM, so this is isn't a huge issue for us to fix. I just want to know what we're going to run into when we roll this out to other users, and be prepared for the support load.

Is there any any documentation on the slirp tools? It's rather hard to search for it, given it's history as a precursor to PPP :)

djs55 commented 7 years ago

@matthewbarr could you describe the changes you made to the file to make it work for you? I'd like to see how the logic that generates it can be improved. Currently we scan through

We combine these into the slirp/dns file.

Unfortunately the file is regenerated when the Mac's network settings change in System Preferences and when you join a new wifi network. To make life easier I think I'll make it possible to permanently override the settings for advanced users: something like a slirp/override/dns which takes precedence over the automatically detected setting.

There's no documentation yet for this stuff -- the original hope was to make it "just work" for most people. However it sounds like there will always be some environments where a manual override -- and therefore documentation for it -- is necessary.

Although the slirp concept is quite old, the tools we're using are quite new: the code is in the https://github.com/docker/vpnkit project and heavily uses networking libraries from the MirageOS project.

Thanks again for your report!

matthewbarr commented 7 years ago

What's interesting is that I'm not seeing the slirp/dns file change, even when switching Wifi, or when I disconnect from the VPN.

I do see echo show State:/Network/Global/DNS | scuttl 's response change when I turn off the VPN, and when I switch Wifi. /etc/resolv.conf updates, but I don't have /etc/resolver.

I don't so much want to override the settings, but this doesn't sound like I'm seeing planned behavior. It should just work, which would be great, but the last change time for slirp/dns (before I manually edited it ) was mid Feb. I take my laptop home and use it on my home network quite often, and have restarted the laptop a few times. I've also updated Docker for Mac there, and many restarts of Moby.

So it sounds like we might need to trouble shoot the update mechanism.

FYI: here's the old slirp/dns file:

# { Addresses: 172.27.112.15, 172.27.2.20; Order: 200000; Zones:  }
nameserver 172.27.112.15
order 200000
timeout 2000
nameserver 172.27.2.20
order 200000
timeout 2000
# { Addresses: 23.216.52.9, 23.216.53.9; Order: 200000; Zones:  }
nameserver 23.216.52.9
order 200000
timeout 2000
nameserver 23.216.53.9
order 200000
timeout 2000
# { Addresses: 172.27.112.15, 172.27.2.20; Order: 200000; Zones:  }
nameserver 172.27.112.15
order 200000
timeout 2000
nameserver 172.27.2.20
order 200000
timeout 2000

search foo.example.com example.com
matthewbarr commented 7 years ago

Just looked again, and I am seeing the git DB updating the DNS. But it is still generating a the same set of 3 address sets as above, and for some reason, is prioritizing the 23.* name servers.

I was (wrongly) assuming it would edit the file on disk, but datakit just commits the change directly.

It does pick up VPN changes, and everything.

I can see what it's doing now: it's grabbing all the DNS entries, and merging them into one file.

I'm seeing them on : echo show State:/Network/Global/DNS | scutil echo show State:/Network/Service/10EB183F-E391-4EED-9D07-6D39E4F9B69F/DNS | scutil echo show State:/Network/Service/net.juniper.pulse.nc.main/DNS | scutil

What I'd expect the behavior to be is to use only the contents of key: State:/Network/Global/DNS.

For some reason, it's always using the middle stanza from the file, which is the 23.* addresses.

xeger commented 7 years ago

I'm falling into a similar state as @matthewbarr. Interface-specific scutil settings are bad, whereas my global settings are what I would prefer to use.

echo list | scutil | grep DNS

  subKey [9] = Setup:/Network/Service/17D3AE8C-3B27-4FFC-B5D4-AC05B4810BC4/DNS
  subKey [15] = Setup:/Network/Service/20C319D5-60E0-4190-85E4-56ECDE449852/DNS
  subKey [39] = State:/Network/Global/DNS
  subKey [67] = State:/Network/MulticastDNS
  subKey [69] = State:/Network/PrivateDNS
  subKey [71] = State:/Network/Service/17D3AE8C-3B27-4FFC-B5D4-AC05B4810BC4/DNS
  subKey [75] = State:/Network/Service/20C319D5-60E0-4190-85E4-56ECDE449852/DNS
  subKey [79] = State:/Network/mDNSResponder/DebugState

echo show State:/Network/Global/DNS | scutil This would be perfectly acceptable:

<dictionary> {
  DomainName : RightScale.com
  SearchDomains : <array> {
    0 : test.rightscale.com
    1 : rightscale.com
  }
  ServerAddresses : <array> {
    0 : 8.8.8.8
    1 : 8.8.4.4
  }
}

echo show State:/Network/Service/17D3AE8C-3B27-4FFC-B5D4-AC05B4810BC4/DNS | scutil This is causing an issue:

<dictionary> {
  DomainName : RightScale.com
  ServerAddresses : <array> {
    0 : 10.10.1.4
    1 : 8.8.8.8
    2 : 4.2.2.1
    3 : 4.2.2.2
  }
}
xeger commented 7 years ago

Workaround (YMMV)

Assuming you want to retain only the Global DNS settings and not interface-specific DNS -- i.e. assuming that you have no VPN connection -- then you can run this script every time your Mac returns from sleep:

#! /bin/sh -e

baddies=`echo list | scutil | grep -E 'State:.+/DNS' | grep -v Global | cut -d' ' -f6`

for b in $baddies
do
  echo "Removing DNS state for $b"
  echo "remove $b" | sudo scutil
done

You can customize it to target only the settings you "want" to keep. (Thanks @matthewbarr for pointing out the sleep issue.)

matthewbarr commented 7 years ago

I'm afraid those are going to be reset as soon as you connect to a new network, or come back from sleep.

I really do just want to use the Global DNS key, but some clarify on how slirp is picking it's stanza would help.

I don't have an older install handy, but is there a change from the prior behavior?

xeger commented 7 years ago

Nor do I (have an older install), but I first noticed this problem after upgrading to Docker for Mac 17.03.0-ce-mac2 sometime over the weekend, and the behavior of my image -- testing at startup whether a certain hostname resolves -- hasn't changed in months.

So, I'm guessing that slirp has recently changed something about the way it consumes/prioritizes scutil output.

(Aside: for an Internet old-timer, "slirp" has some amusing connotations: https://en.wikipedia.org/wiki/Slirp)

justincormack commented 7 years ago

@xeger maybe the Wikipedia article should be amended "to the date there is no competing implementation available." - now there is...

matthewbarr commented 7 years ago

And I remember using SLIRP, before getting a PPP connection...

I think in most cases using only (or prioritizing highly) the global is probably the correct answer. It's not like a routing table, where you can determine which path is best.

xeger commented 7 years ago

VPNKit is there to serve road warriors, and there will be cases where a VPN-related interface pushes routes and DNS settings that are useful to DfM's proper functioning in that environment -- but in cases where "I know better" (and have overridden OS X to use the global settings only), ideally DfM would follow that cue, or provide me a way to enforce the same override in its own settings.

matthewbarr commented 7 years ago

@djs55 @MagnusS - what can I get for you to help with this?

While I can fix it for any given network session, the DNS settings are updated every time anything on the networking front changes.

I don't know how the proxy picks which stanza to use, but there is a lot of duplication, and I'd imagine the best default would be the global DNS key.

djs55 commented 7 years ago

@matthewbarr I'm working on the DNS code at the moment. If I were to make slightly experimental builds over the next few days, would you be able to try them in your environment for me?

matthewbarr commented 7 years ago

I was worried because I saw @MagnusS close things that were still tagged more info needed.

Happy to help. I can absolutely test experimental builds.

If this was on the Linux kernel side, I'd see if our kernel devs could help, but we don't have any Mac devs locally.

I'm also curious where the proxy slirp code lives, in the code base.

MagnusS commented 7 years ago

@matthewbarr we close issues that are marked 0-more-info-needed after 2 weeks without a response. I'll leave this issue open :-)

djs55 commented 7 years ago

@matthewbarr I've made 2 experimental builds of the networking component. If you're still able to try them, then first make a backup of the current version:

$ cp /Applications/Docker.app/Contents/MacOS/com.docker.slirp /Applications/Docker.app/Contents/MacOS/com.docker.slirp.backup

(that assumes you've installed the app in /Applications/Docker.app)

The first experiment fixes some bugs in the server selection algorithm and is downloaded from:

$ wget https://673-58395340-gh.circle-artifacts.com/0/Users/distiller/vpnkit/vpnkit.tgz
$ shasum vpnkit.tgz
055b2aaa83e64430b6ad2482b33ebe67f4b9fe83  vpnkit.tgz

It should be unpacked and installed like this:

$ tar -xvzf vpnkit.tgz
x Contents/
x Contents/MacOS/
x Contents/Resources/
x Contents/Resources/lib/
x Contents/MacOS/vpnkit
$ cp Contents/MacOS/vpnkit /Applications/Docker.app/Contents/MacOS/com.docker.slirp

Then the app should be restarted.

The second experiment uses a completely different system API to resolve names. It can be downloaded from here:

$ wget https://675-58395340-gh.circle-artifacts.com/0/Users/distiller/vpnkit/vpnkit.tgz
$ shasum vpnkit.tgz.1
c82698a8f5947aec2a67fdb81a9ac504c4ac23d0  vpnkit.tgz.1

(note this created vpnkit.tgz.1 because the filename vpnkit.tgz was already taken)

Unpack and install in the same way:

$  tar -xvzf vpnkit.tgz.1
x Contents/
x Contents/MacOS/
x Contents/Resources/
x Contents/Resources/lib/
x Contents/MacOS/vpnkit
$ cp Contents/MacOS/vpnkit /Applications/Docker.app/Contents/MacOS/com.docker.slirp

Then the app should be restarted. Please let me know if either of these builds make any difference in your environment!

For reference the networking code on the Mac is: https://github.com/docker/vpnkit -- it uses a number of the networking libraries from the MirageOS project.

Thanks for all your help so far!

matthewbarr commented 7 years ago

First build in place, and doesn't seem to change anything. It's picking the DNS server from my wifi interface, vs the Global, or the net.juniper.pulse.nc.main interface (VPN client). VPN & Global are the same.

This is home, vs prior testing at work. (And yes, our work Wifi requires a VPN to actually be treated as inside the corp network. ) Same exact behavior, just different DNS servers and wifi networks, etc.

Next, will test 2nd build.

matthewbarr commented 7 years ago

2nd build for the win!

2nd build seems to have changed the result, to what I'd expect. I'm getting results from the DNS server that's in the Global & VPN key.

If there is a way to poll the proxy for better insight into it's behavior, beyond just doing a query, I'm all ears to give you better data. Thankfully, my current method seems to actually be accurate, but I'm all for better testing methodologies.

matthewbarr commented 7 years ago

Looking at the pull request, I'm not sure if the triggers you mention will result in the desired behavior. I don't have an empty slirp/dns, we just seem to be prioritizing the wrong resolver set.

djs55 commented 7 years ago

@matthewbarr the 2nd build had a different resolution mechanism as an experiment -- it actually ignored the slirp/dns file altogether and funnelled all queries through the host's getaddrinfo API. The advantage of this is that it's very simple and the results should be the same as those seen by regular apps. If this approach works more generally then we could move away from tracking the individual upstream servers and interpreting the keys in the SC database i.e. no need for a slirp/dns file at all.

One downside of getaddrinfo is that it only allows querying of A and AAAA records. I'm currently experimenting with the macOS-specific DNSService* APIs which seem capable of resolving arbitrary resource record types.

My plan is to integrate this code but have it off by default at first in beta/edge, to be cautious. My plan was to interpret the slirp/dns key as follows:

Assuming this goes well, I was hoping to make the new behaviour the default. If it doesn't go well then I'll try to fix the server list.

I hope this makes sense!

matthewbarr commented 7 years ago

Yes, I'd read the pull request, and I was just noting that our slirp/dns files are not empty of servers.

What I haven't seen is how it's determining which of the server stanzas in slirp/dns to pick to use as the resolver. I think it should default to the data from the Global key, vs an interface specific key, unless there's some supplemental domain stuff going on.

gavinbunney commented 7 years ago

Can confirm that the second build (with the System API change) fixes the VPN issues I've been experiencing as well 👍

markterm commented 7 years ago

@djs55 has this been integrated into Edge yet? If so how can we turn it on? Thanks :)

djs55 commented 7 years ago

@mark9white the new DNS resolver code (which uses the Mac's resolver APIs) is now default in 17.06. Could you give it a go and let me know what you think? There's one known problem -- .local domains don't work -- which is due to be fixed in 17.06.1 later this week.

Since the code is switched on and believed to work, I'll close this issue. If it doesn't work for you, could you let me know and I'll reopen. Thanks!

markterm commented 7 years ago

ah, the domain is a .local - thanks!

docker-robott commented 4 years ago

Closed issues are locked after 30 days of inactivity. This helps our team focus on active issues.

If you have found a problem that seems similar to this, please open a new issue.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle locked