docker / for-mac

Bug reports for Docker Desktop for Mac
https://www.docker.com/products/docker#/mac
2.44k stars 117 forks source link

4.19.0 breaks vpnkit and some networking functionality #6825

Closed omgftw closed 1 year ago

omgftw commented 1 year ago

Expected behavior

vpnkit should start. It is working properly on 4.18.0

Actual behavior

vpnkit fails and some host network functionality no longer functions (ex: reaching the AWS EC2 metadata endpoint from within a container). I have been able to reproduce this across multiple users and machines (work and personal). Has been tested on x86, M1, and M2 machines.

Information

Output of /Applications/Docker.app/Contents/MacOS/com.docker.diagnose check

...
[FAIL] DD0014: are the backend processes running? 1 error occurred:
        * com.docker.vpnkit is not running
...
[FAIL] DD0009: is the vpnkit API responding? dial unix vpnkit.diag.sock: connect: connection refused
...
1 : The test: are the backend processes running?
    Failed with: 1 error occurred:
        * com.docker.vpnkit is not running

Steps to reproduce the behavior

No specific steps. On every machine so far it occurs 100% of the time with 4.19.0, across docker and computer restarts. If I can provide any additional information, please let me know.

christophermclellan commented 1 year ago

Hi @omgftw , sorry about this. In v4.19 we've replaced vpnkit with gvisor for faster networking performance. You can still use vpnkit by adding “networkType”:”vpnkit” to your settings.json file (located in ~/Library/Group Containers/group.com.docker/settings.json). Let us know if there are any further issues.

cc @djs55

djs55 commented 1 year ago

@omgftw Regarding

some host network functionality no longer functions (ex: reaching the AWS EC2 metadata endpoint from within a container).

Could you provide a repro example to demonstrate the problem?

omgftw commented 1 year ago

@djs55 Sure thing.

Steps to reproduce the behavior

The EC2 metadata issue can be easily mocked and reproduced as follows.

sudo ifconfig lo0 alias 169.254.169.254 # create an interface alias
sudo python3 -m http.server --bind 169.254.169.254 80 # start an http server listening on the metadata IP
docker run --rm alpine/curl curl http://169.254.169.254 # or whatever method you prefer

cleanup - remove the alias (This is optional since the alias should not persist reboots)

sudo ifconfig lo0 -alias 169.254.169.254

4.18.0: works
4.19.0: curl: (7) Failed to connect to 169.254.169.254 port 80 after 4 ms: Connection refused

This is actually a pretty major breaking change for us. We, essentially, use a mock IMDS endpoint to authenticate and fetch short-lived IAM credentials to allow our apps to authenticate in dev, so we can use the same auth/permissions mechanisms from dev to prod.

Hoping this is a gvisor misconfiguration or bug that can be resolved as otherwise we'll have to update hundreds of engineers configurations to switch it back to vpnkit. In that event is there any guarantee for long-term vpnkit support or is this a first pass at deprecation?

djs55 commented 1 year ago

@omgftw thanks for the clear repro steps. It looks like it is a simple gvisor misconfiguration. I'm making a build with a candidate fix and will share it here when it's ready. Sorry for the disruption this is causing!

djs55 commented 1 year ago

@omgftw notarised builds with a candidate fix:

When I install this build on my machine and run the python HTTP server I see:

% docker run --rm alpine/curl curl http://169.254.169.254
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0<!DOCTYPE HTML>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Directory listing for /</title>
</head>
<body>
<h1>Directory listing for /</h1>
<hr>
...
omgftw commented 1 year ago

@djs55 That's great to hear and no problem at all. We just advised people internally to hold off on 4.19.0.

Wow that was fast. I have validated on my end as well and can no longer reproduce the problem. Things are working as expected. Only thing I'd note (which you're probably already aware of) is that the com.docker.diagnose tool should likely be updated to reflect this change.

All set as far as this issue is concerned. Would you like me to close it? (feel free to close it yourself if you would prefer)

Really appreciate your help!

djs55 commented 1 year ago

@omgftw thanks for checking the fix. Regarding the com.docker.diagnose issue: there should be a fix for that in 4.20. I propose to leave this issue open and then we'll close it officially when the release with the fix ships. In the meantime let me know if you spot any other networking issues!

sandstrom commented 1 year ago

We're also having connectivity issues on 4.19. Works on 4.18.

Our scenario

We're running a caddy instance as a proxy to other containers.

Here is a slimmed down version of our caddy file, in case it helps.

Traffic reaches Caddy, but the reverse_proxying doesn't work.

We get an error from caddy:

dial tcp 192.168.65.2:5510: connect: connection refused

Full caddy error message:

{"level":"error","ts":1683219371.3035526,"logger":"http.log.error.log0","msg":"dial tcp 192.168.65.2:5510: connect: connection refused","request":{"remote_addr":"10.15.0.1:34670","proto":"HTTP/2.0","method":"GET","host":"hello.dev","uri":"/","headers":{"Cache-Control":["no-cache"],"Sec-Ch-Ua-Platform":["\"macOS\""],"Accept":["text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"],"Cookie":["_lfa=LF1.1.d4799e2bdbff4068.1678907466916; _ga=GA1.2.1091812264.1680003830"],"Sec-Ch-Ua":["\"Google Chrome\";v=\"113\", \"Chromium\";v=\"113\", \"Not-A.Brand\";v=\"24\""],"Upgrade-Insecure-Requests":["1"],"Accept-Encoding":["gzip, deflate, br"],"Accept-Language":["en-GB,en-US;q=0.9,en;q=0.8,sv;q=0.7"],"Pragma":["no-cache"],"Sec-Ch-Ua-Mobile":["?0"],"Sec-Fetch-Site":["none"],"Sec-Fetch-Mode":["navigate"],"Sec-Fetch-User":["?1"],"Sec-Fetch-Dest":["document"]},"tls":{"resumed":false,"version":772,"cipher_suite":4865,"proto":"h2","proto_mutual":true,"server_name":"hello.dev"}},"duration":0.000671688,"status":502,"err_id":"168rae3tz","err_trace":"reverseproxy.statusError (reverseproxy.go:886)"}

Caddy file

A simplified version of our caddy file:

{
  auto_https disable_redirects
  ocsp_stapling off
}

(shared) {
  log {
    output stdout
    format console
  }

  tls internal {
    on_demand
  }
}

hello.dev {
  import shared

  reverse_proxy /foo* {env.DOCKER_HOST_IP}:5521
  reverse_proxy /bar* {env.DOCKER_HOST_IP}:5531
  reverse_proxy * {env.DOCKER_HOST_IP}:5510
}
joegoggins commented 1 year ago

I tried installing https://desktop-stage.docker.com/mac/main/arm64/106744/Docker.dmg on my Macbook M1 laptop running Ventura 13.3.1.

Running /Applications/Docker.app/Contents/MacOS/com.docker.diagnose check shows com.docker.vpnkit is not running.

When I open ~/Library/Group Containers/group.com.docker/settings.json I see

  "networkType": "gvisor",
  "useVpnkit": true,

Changing networkType to vpnkit, then quiting and starting docker desktop, makes the diagnostics work again.

Then I run docker stack deploy and things work initially, I'm able to open https://localhost:3001 locally. i'm running as a single node swarm here.

When I run docker stack rm though the system reverts to a failed state. After this,

docker ps returns Error response from daemon: Bad response from Docker engine

Then when I re-run diagnostics, it errors like this (below) and fatally exits docker, so next run of docker ps returns Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Starting diagnostics

[PASS] DD0027: is there available disk space on the host?
[PASS] DD0028: is there available VM disk space?
[PASS] DD0018: does the host support virtualization?
[PASS] DD0001: is the application running?
[PASS] DD0017: can a VM be started?
[PASS] DD0016: is the LinuxKit VM running?
[PASS] DD0011: are the LinuxKit services running?
[FAIL] DD0004: is the Docker engine running? Get "http://ipc/docker": EOF
[2023-05-12T15:58:21.464148000Z][com.docker.diagnose][I] ipc.NewClient: bf8f3295-com.docker.diagnose -> lifecycle-server.sock VMDockerdAPI
[2023-05-12T15:58:21.464306000Z][com.docker.diagnose][I] (75238f29) bf8f3295-com.docker.diagnose C->S VMDockerdAPI GET /docker
[2023-05-12T15:58:21.466498000Z][com.docker.diagnose][W] (75238f29) bf8f3295-com.docker.diagnose C<-S NoResponse GET /docker (2.18775ms): Get "http://ipc/docker": EOF
[2023-05-12T15:58:21.466694000Z][com.docker.diagnose][I] (75238f29-1) bf8f3295-com.docker.diagnose C->S VMDockerdAPI GET /ping
[2023-05-12T15:58:21.469138000Z][com.docker.diagnose][W] (75238f29-1) bf8f3295-com.docker.diagnose C<-S NoResponse GET /ping (2.442333ms): Get "http://ipc/ping": EOF
[2023-05-12T15:58:22.470633000Z][com.docker.diagnose][I] (75238f29-2) bf8f3295-com.docker.diagnose C->S VMDockerdAPI GET /ping
[2023-05-12T15:58:22.473932000Z][com.docker.diagnose][W] (75238f29-2) bf8f3295-com.docker.diagnose C<-S NoResponse GET /ping (3.396167ms): Get "http://ipc/ping": EOF
[2023-05-12T15:58:23.474808000Z][com.docker.diagnose][I] (75238f29-3) bf8f3295-com.docker.diagnose C->S VMDockerdAPI GET /ping
[2023-05-12T15:58:23.479509000Z][com.docker.diagnose][W] (75238f29-3) bf8f3295-com.docker.diagnose C<-S NoResponse GET /ping (4.702083ms): Get "http://ipc/ping": EOF
[2023-05-12T15:58:24.481745000Z][com.docker.diagnose][I] (75238f29-4) bf8f3295-com.docker.diagnose C->S VMDockerdAPI GET /ping
[2023-05-12T15:58:24.487565000Z][com.docker.diagnose][W] (75238f29-4) bf8f3295-com.docker.diagnose C<-S NoResponse GET /ping (5.810834ms): Get "http://ipc/ping": EOF
[2023-05-12T15:58:25.492241000Z][com.docker.diagnose][I] (75238f29-5) bf8f3295-com.docker.diagnose C->S VMDockerdAPI GET /ping
[2023-05-12T15:58:25.495886000Z][com.docker.diagnose][W] (75238f29-5) bf8f3295-com.docker.diagnose C<-S NoResponse GET /ping (3.644792ms): Get "http://ipc/ping": EOF
[2023-05-12T15:58:26.497771000Z][com.docker.diagnose][I] (75238f29-6) bf8f3295-com.docker.diagnose C->S VMDockerdAPI GET /ping
[2023-05-12T15:58:26.503528000Z][com.docker.diagnose][W] (75238f29-6) bf8f3295-com.docker.diagnose C<-S NoResponse GET /ping (5.739834ms): Get "http://ipc/ping": EOF
[2023-05-12T15:58:27.505617000Z][com.docker.diagnose][I] (75238f29-7) bf8f3295-com.docker.diagnose C->S VMDockerdAPI GET /ping
[2023-05-12T15:58:27.509615000Z][com.docker.diagnose][W] (75238f29-7) bf8f3295-com.docker.diagnose C<-S NoResponse GET /ping (4.003041ms): Get "http://ipc/ping": EOF
[2023-05-12T15:58:28.511542000Z][com.docker.diagnose][I] (75238f29-8) bf8f3295-com.docker.diagnose C->S VMDockerdAPI GET /ping
[2023-05-12T15:58:28.523601000Z][com.docker.diagnose][W] (75238f29-8) bf8f3295-com.docker.diagnose C<-S NoResponse GET /ping (12.069583ms): Get "http://ipc/ping": EOF

[PASS] DD0015: are the binary symlinks installed?
[FAIL] DD0031: does the Docker API work? error during connect: Get "http://docker.raw.sock/v1.24/containers/json": EOF
[PASS] DD0013: is the $PATH ok?
Error response from daemon: Bad response from Docker engine
[FAIL] DD0003: is the Docker CLI working? exit status 1
[FAIL] DD0038: is the connection to Docker working? HTTP GET https://login.docker.com: Get "https://login.docker.com": proxyconnect tcp: dial unix httpproxy.sock: connect: no such file or directory
[PASS] DD0014: are the backend processes running?
[PASS] DD0007: is the backend responding?
[PASS] DD0008: is the native API responding?
[PASS] DD0009: is the vpnkit API responding?
[PASS] DD0010: is the Docker API proxy responding?
[SKIP] DD0030: is the image access management authorized?
[PASS] DD0033: does the host have Internet access?
[PASS] DD0018: does the host support virtualization?
[PASS] DD0001: is the application running?
[PASS] DD0017: can a VM be started?
[PASS] DD0016: is the LinuxKit VM running?
[PASS] DD0011: are the LinuxKit services running?
[WARN] DD0004: is the Docker engine running? Get "http://ipc/docker": EOF
[PASS] DD0015: are the binary symlinks installed?
[WARN] DD0031: does the Docker API work? error during connect: Get "http://docker.raw.sock/v1.24/containers/json": EOF
[WARN] DD0032: do Docker networks overlap with host IPs? error during connect: Get "http://docker.raw.sock/v1.24/networks": EOF

Oddly, when I start Docker again, deploy and destroy my stack it does not crash and diagnostics pass now. For now, it looks like I can do some actual development without Docker Desktop crashing. Will report back if I see other issues come up.

...and the Error response from daemon: Bad response from Docker engine error is back. Since reverting to 4.18 didn't help either, I'm going back to known good state on 4.12 and will stay tuned to this issue for a fix in a future release.

eararipe commented 1 year ago

Hi, I'm having a problem that might be similar. I have an app that binds to port 8081 and will not start on a hackintosh with Docker Desktop 4.19 and Ventura 13.3.1.

It works on another hackintosh and on my MacBook. They are all Intel machines.

The affected computer has a RTL8125 nic. It's working on 4.18 and changing network to vpnkit or using @djs55 release candidate did not work. Can I help in anyway?

edit: I noticed that on the problem computer the port is show as 0.0.0.0:8081, while on the others it's localhost:8081

djs55 commented 1 year ago

Thanks for the reports. I notice env.DOCKER_HOST_IP in one of the examples above. I'm not sure where this environment variable is defined, but the best way to access the host is via the DNS name host.docker.internal. In 4.19 the IP address changed:

> docker run -it alpine
/ # ping host.docker.internal
PING host.docker.internal (192.168.65.254): 56 data bytes
64 bytes from 192.168.65.254: seq=0 ttl=62 time=3.080 ms
^C
--- host.docker.internal ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 3.080/3.080/3.080 ms

Can you use the DNS name instead of an IP? This ensures the value will be correct even if the internal IP range is changed in the settings UI.

sandstrom commented 1 year ago

In our case DOCKER_HOST_IP was hard-coded to 192.168.65.2. We'll investigate if the changed IP might have been the reason.

davem-foreflight commented 1 year ago

Just ran into this with 4.19 on Friday... Burned a day because docker was my last suspected reason for a metadata lookup to fail. :^).

lorenrh commented 1 year ago

Closing this issue because a fix has been released in Docker Desktop 4.20. See the release notes for more details.