Image pulls initiated by K8S are subject to a 2 minute timeout

adamalex commented 2 years ago

[x] I have tried with the latest version of Docker Desktop
[x] I have tried disabling enabled experimental features
[x] I have uploaded Diagnostics
Diagnostics ID: 141CCEF5-12D2-44D4-AA0D-2A159BFBD826/20220502204139

Expected behavior

Image pulls initiated by K8S should succeed even if they take longer than 2 minutes.

Actual behavior

Image pulls initiated by K8S result in ImagePullBackoff if download does not complete within 2 minutes. The image pull is retried, but the pod will stay in this status forever if retries last longer than 2 minutes.

Information

Is it reproducible? Yes
Is the problem new? Possibly
Did the problem appear with an update? Only noticed in the past couple Docker Desktop versions
macOS Version: 10.15.7
Intel chip or Apple chip: Intel
Docker Desktop Version: 4.7.1 (77678)

Output of `/Applications/Docker.app/Contents/MacOS/com.docker.diagnose check`

Starting diagnostics

[PASS] DD0027: is there available disk space on the host?
[PASS] DD0028: is there available VM disk space?
[PASS] DD0031: does the Docker API work?
[PASS] DD0004: is the Docker engine running?
[PASS] DD0011: are the LinuxKit services running?
[PASS] DD0016: is the LinuxKit VM running?
[PASS] DD0001: is the application running?
[PASS] DD0018: does the host support virtualization?
[PASS] DD0017: can a VM be started?
[PASS] DD0015: are the binary symlinks installed?
[PASS] DD0003: is the Docker CLI working?
[PASS] DD0013: is the $PATH ok?
[PASS] DD0007: is the backend responding?
[PASS] DD0014: are the backend processes running?
[PASS] DD0008: is the native API responding?
[PASS] DD0009: is the vpnkit API responding?
[PASS] DD0010: is the Docker API proxy responding?
[FAIL] DD0012: is the VM networking working? network checks failed: failed to ping host: exit status 1
[2022-05-02T20:56:49.692594000Z][com.docker.diagnose][I] ipc.NewClient: a06a3fb2-diagnose-network -> <HOME>/Library/Containers/com.docker.docker/Data/diagnosticd.sock diagnosticsd
[common/pkg/diagkit/gather/diagnose.runIsVMNetworkingOK()
[   common/pkg/diagkit/gather/diagnose/network.go:34 +0xdd
[common/pkg/diagkit/gather/diagnose.(*test).GetResult(0x4d30320)
[   common/pkg/diagkit/gather/diagnose/test.go:46 +0x43
[common/pkg/diagkit/gather/diagnose.Run.func1(0x4d30320)
[   common/pkg/diagkit/gather/diagnose/run.go:17 +0x5a
[common/pkg/diagkit/gather/diagnose.walkOnce.func1(0x2?, 0x4d30320)
[   common/pkg/diagkit/gather/diagnose/run.go:140 +0x77
[common/pkg/diagkit/gather/diagnose.walkDepthFirst(0x1, 0x4d30320, 0xc000787730)
[   common/pkg/diagkit/gather/diagnose/run.go:146 +0x36
[common/pkg/diagkit/gather/diagnose.walkDepthFirst(0x0, 0x4?, 0xc000787730)
[   common/pkg/diagkit/gather/diagnose/run.go:149 +0x73
[common/pkg/diagkit/gather/diagnose.walkOnce(0x46eca00?, 0xc00035f890)
[   common/pkg/diagkit/gather/diagnose/run.go:135 +0xcc
[common/pkg/diagkit/gather/diagnose.Run(0x4d301a0, 0x46e6020?, {0xc00035fb18, 0x1, 0x1})
[   common/pkg/diagkit/gather/diagnose/run.go:16 +0x1cb
[main.checkCmd({0xc000032050?, 0x6?, 0x4?}, {0x0, 0x0})
[   common/cmd/com.docker.diagnose/main.go:131 +0x105
[main.main()
[   common/cmd/com.docker.diagnose/main.go:97 +0x2a8
[2022-05-02T20:56:49.692706000Z][com.docker.diagnose][I] (d8fd78f5) a06a3fb2-diagnose-network C->S diagnosticsd POST /check-network-connectivity: {"ips":["xxx.xxx.xxx.xxx","yyy.yyy.yyy.yyy"]}
[2022-05-02T20:56:50.208189000Z][com.docker.diagnose][E] (d8fd78f5) a06a3fb2-diagnose-network C<-S 95df0e1e-diagnosticsd POST /check-network-connectivity (515.501828ms): failed to ping host: exit status 1
[common/pkg/diagkit/gather/diagnose.runIsVMNetworkingOK()
[   common/pkg/diagkit/gather/diagnose/network.go:35 +0x15b
[common/pkg/diagkit/gather/diagnose.(*test).GetResult(0x4d30320)
[   common/pkg/diagkit/gather/diagnose/test.go:46 +0x43
[common/pkg/diagkit/gather/diagnose.Run.func1(0x4d30320)
[   common/pkg/diagkit/gather/diagnose/run.go:17 +0x5a
[common/pkg/diagkit/gather/diagnose.walkOnce.func1(0x2?, 0x4d30320)
[   common/pkg/diagkit/gather/diagnose/run.go:140 +0x77
[common/pkg/diagkit/gather/diagnose.walkDepthFirst(0x1, 0x4d30320, 0xc000787730)
[   common/pkg/diagkit/gather/diagnose/run.go:146 +0x36
[common/pkg/diagkit/gather/diagnose.walkDepthFirst(0x0, 0x4?, 0xc000787730)
[   common/pkg/diagkit/gather/diagnose/run.go:149 +0x73
[common/pkg/diagkit/gather/diagnose.walkOnce(0x46eca00?, 0xc00035f890)
[   common/pkg/diagkit/gather/diagnose/run.go:135 +0xcc
[common/pkg/diagkit/gather/diagnose.Run(0x4d301a0, 0x46e6020?, {0xc00035fb18, 0x1, 0x1})
[   common/pkg/diagkit/gather/diagnose/run.go:16 +0x1cb
[main.checkCmd({0xc000032050?, 0x6?, 0x4?}, {0x0, 0x0})
[   common/cmd/com.docker.diagnose/main.go:131 +0x105
[main.main()
[   common/cmd/com.docker.diagnose/main.go:97 +0x2a8

[PASS] DD0032: do Docker networks overlap with host IPs?
[SKIP] DD0030: is the image access management authorized?
[PASS] DD0019: is the com.docker.vmnetd process responding?
[PASS] DD0033: does the host have Internet access?

Please investigate the following 1 issue:

1 : The test: is the VM networking working?
    Failed with: network checks failed: failed to ping host: exit status 1

VM seems to have a network connectivity issue. Please check your host firewall and anti-virus settings in case they are blocking the VM.

Steps to reproduce the behavior

This test pod uses a large image:

apiVersion: v1
kind: Pod
metadata:
  name: splunktest
spec:
  containers:
  - name: splunktest
    image: splunk/splunk
    env:
    - name: SPLUNK_START_ARGS
      value: --accept-license
    - name: SPLUNK_PASSWORD
      value: password

Save the above yaml to a file such as test.yaml
Run kubectl apply -f test.yaml
If the connection is slow enough, the pod will enter ImagePullBackoff state after 2 minutes
Notice that docker pull splunk/splunk will succeed, even if it takes longer than 2 minutes

adamalex commented 2 years ago

I am not sure the network diagnostics error shown above is an issue (note that I replaced the two IP addresses with x.x.x.x and y.y.y.y). Other images pull via K8S without issue as long as they pull within 2 minutes. All of my other use cases are working fine.

adamalex commented 2 years ago

This may be configurable with the runtimeRequestTimeout setting documented at https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/

Is it possible to customize the kubelet configuration used by Docker Desktop?

mgabeler-lee-6rs commented 2 years ago

Having a similar issue here, other info I've found suggests adjusting the image-pull-progress-deadline parameter, but that doesn't seem to be configurable for DD either, as it seems the kubelet.yaml file DD uses is created on the fly.

alubbe commented 2 years ago

Rancher for Desktop is also affected by this https://github.com/rancher-sandbox/rancher-desktop/issues/2303

I just wanted to add to this issue that when you're developing locally and from your home office, your internet may not be fast enough to download multi-gigabytes docker images in under 2 minutes. And for smaller images, you're subject to a shaky internet connection (family might use netflix or torrents). The current 2 minute time limit results in a very bad user experience.

From my own research, I agree with @adamalex that this can be fixed by changing the kubelet configuration, specifically by increasing the runtimeRequestTimeout setting documented at https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/

I see three potential solutions:

Find a new global default (like increasing it to 10 minutes or similar, but this might have unintended consequences)
Make this number configurable by the user via the UI
Give users the ability to modify the kubelet config file on disk and change k8s to use this file when it exists

Shady6 commented 2 years ago

Hello, has anybody found a workaround for this? I know you can use this https://github.com/justincormack/nsenter1 to get inside of the docker-desktop VM but how could we access master node and the kubelet command from there?

mj3c commented 2 years ago

Experiencing issues due to this 2 minute limit as well. This forces users to do a docker pull of all the big images before creating the K8s resources that need them. It seems pretty important to have an option to configure/increase this timeout. Any ideas/workarounds yet?

Meemaw commented 2 years ago

Trying to use tilt with Docker Desktop but these timeouts really complicate things. @nicks do you have any recommendations?

nicks commented 2 years ago

does it fix the problem if you pull the image before you deploy your pods? ya, configuring the kubelet on the fly is non-trivial since upstream kubernetes removed DynamicKubeletConfiguration :\

Meemaw commented 2 years ago

@nicks we are working around this by using local_resource docker pulls and setting these as resource_deps for our k8s_resource which pulls the image on host and is then reused by docker-desktop k8s.

This is very ugly though, and complicates Tiltfile significantly.

luiz1361 commented 2 years ago

Any news on this issue? At the moment having to pull large images from outside deployments via 'docker pull' command as a workaround due to large images timing out.

nicks commented 2 years ago

no news yet.

There might be a short-term workaround to shell into the VM (https://www.bretfisher.com/docker-for-mac-commands-for-getting-into-local-docker-vm/) and restart the kubelet with --runtime-request-timeout (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) ...you could probably write a container that does it...but i'd need to sit down and fiddle with it for a while to figure out the right incantation.

the "right" solution is to provide some way to pass config parameters to kubeadm in docker desktop (as in https://github.com/docker/roadmap/issues/139), but this is a more complex product question.

nicks commented 1 year ago

OK, the super hacky way to do this (which i should probably package up into a container):

1) Start docker desktop and enable kubernetes

2) In a terminal, run:

docker run -it --rm --privileged --pid=host justincormack/nsenter1
nsenter -t $(pgrep kubelet) -m
echo "runtimeRequestTimeout: 10m" >> /etc/kubeadm/kubelet.yaml

3) Reset the kubernetes cluster from the UI (which will pick up the kubelet config changes)

WARNING: use at your own risk, if you mess up the kubeadm config in the VM you'll probably have to factory reset to fix it.

Venryx commented 1 year ago

In a terminal, run:

docker run -it --rm --privileged --pid=host justincormack/nsenter1
nsenter -t $(pgrep kubelet) -m
echo "runtimeRequestTimeout: 10m" >> /etc/kubeadm/kubelet.yaml

Is there a way to do this using docker exec rather than nsenter?

I started attempting your approach using nsenter, but then realized I would also need to install an nsenter binary on my host computer, which I'm not sure how to do on Windows. (also, nsenter's repo is archived so I'm wanting to find a solution that will work further into the future)

Also, I was able to find two files in Docker Desktop's filesystem (accessible under \\wsl$ through Windows Explorer) that should have fixed the issue. (similar to your solution above, and building off my successful experimenting with another Docker Desktop config problem described here)

Namely, these two files: (in Docker Desktop 4.11.0) 1) \\wsl$\docker-desktop-data\data\kubelet\config.yaml

runtimeRequestTimeout: 30m   <-- line modified according to https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/

2) \\wsl$\docker-desktop-data\data\kubelet\kubeadm-flags.env

KUBELET_KUBEADM_ARGS="--container-runtime=remote --container-runtime-endpoint=unix:///var/run/cri-dockerd.sock --pod-infra-container-image=k8s.gcr.io/pause:3.7 --runtime-request-timeout=30m"
   <-- line modified according to https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

However, even after restarting Docker Desktop and WSL, the config changes above don't seem to have made a difference! (EDIT: Apparently changing this flag in Rancher Desktop hit a similar problem for someone. So maybe the flag, even if successfully provided, just doesn't work in some circumstances/contexts?)

Anyway, if anyone successfully makes the config change on Windows, please share what exact steps that were required. (my file change above should have fixed it, but somehow didn't -- I guess it's possible the timeouts I've been hitting are due to some other issue, but seems unlikely)

Venryx commented 1 year ago

Update: It appears the runtimeRequestTimeout config option no longer works in Docker Desktop 1.24.0+, due to the container runtime changing from dockershim to cri-dockerd, which apparently does not notice/respect that setting, and thus falls back to the default timeout of 2m.

See this comment on how someone fixed the issue by changing the container runtime (when using minikube).

See this issue for a diagnosis of the problem in the source repository, and this pull-request that fixes it. (it appears that this commit was then included in the 0.2.6 release of cri-dockerd -- not sure if this became part of the Kubernetes 1.25.0 release or not)

Anyway, since the root issue appears to be fixed (and it just didn't make it into the Docker Desktop v4.11.0 / Kubernetes v1.24.2 that I'm using), I'm content to use workarounds for now. (eg. manual docker pull X commands prior to the pod deploy/launch)

Update2: I'm now on Kubernetes v1.25.2 (through Docker Desktop 4.15.0), and I still get the "context deadline exceeded" issue in some cases. So, my inference is that the fix referenced above did not make it into Kubernetes v1.25.2. If someone eventually confirms the fix to have been included in a future version, please let everyone know.

robwithhair commented 1 year ago

Does anyone know if there is a way to get this fix on a local install of docker desktop? My understanding from the above is that I just need to wait for the next version to be released?

docker-robott commented 1 year ago

There hasn't been any activity on this issue for a long time. If the problem is still relevant, mark the issue as fresh with a /remove-lifecycle stale comment. If not, this issue will be closed in 30 days.

Prevent issues from auto-closing with a /lifecycle frozen comment.

/lifecycle stale

mj3c commented 1 year ago

This is still an issue that does not have an easy workaround. We are currently using an init container that performs docker pull <image> before starting the main container in order to get around this 2 minute timeout...

The solutions proposed by @alubbe would be much better. Can any of these be implemented?

/remove-lifecycle stale

docker / for-mac