Open anfechtung opened 7 months ago
This is because https://github.com/Mirantis/cri-dockerd/pull/311 switched cri-dockerd from using the deprecated 'Binds' API to the new 'Mounts' API, which does not create missing directories by default: https://github.com/Mirantis/cri-dockerd/commit/bf1a9b950ffac52a722fc530444d6ac89b472170
To preserve backward-compatible behavior, we need to set CreateMountpoint
to true (as it is false, the zero value, by default) in GenerateMountBindings
.
cc @nwneisen @AkihiroSuda
Isn't CreateMountpoint
here working?
https://github.com/Mirantis/cri-dockerd/blob/v0.3.12/libdocker/helpers.go#L224
Shoot, I missed that we're setting that in the diff. @anfechtung could you please let us know what Engine version you are using?
I am assuming by Engine you mean the docker runtime:
root@vm-compute1:~# docker --version
Docker version 24.0.2, build cb74dfc
root@vm-compute1:~#
docker --version
is only the version of the CLI; to get the daemon version please provide docker version
(also please provide docker info
), which will interrogate the client and the server.
Client: Docker Engine - Community
Version: 24.0.2
API version: 1.43
Go version: go1.20.4
Git commit: cb74dfc
Built: Thu May 25 21:52:13 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.2
API version: 1.43 (minimum version 1.12)
Go version: go1.20.4
Git commit: 659604f
Built: Thu May 25 21:52:13 2023
OS/Arch: linux/amd64
Experimental: true
containerd:
Version: 1.6.21
GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc:
Version: 1.1.7
GitCommit: v1.1.7-0-g860f061
docker-init:
Version: 0.19.0
GitCommit: de40ad0
root@vm-compute1:~#
Client: Docker Engine - Community
Version: 24.0.2
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.10.5
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.18.1
Path: /usr/libexec/docker/cli-plugins/docker-compose
root@vm-compute1:~# docker info
Server:
Containers: 156
Running: 101
Paused: 0
Stopped: 55
Images: 79
Server Version: 24.0.2
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Default Runtime: runc
Init Binary: docker-init
containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc version: v1.1.7-0-g860f061
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
Kernel Version: 5.4.0-170-generic
Operating System: Ubuntu 18.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 12.74GiB
Name: vm-compute1
ID: 29543cf3-2f2a-45f6-a42a-cd31c9385775
Docker Root Dir: /var/lib/docker
Debug Mode: false
We might be downgrading the API version to <= v1.41 in somewhere? https://github.com/moby/moby/blob/v24.0.2/api/server/router/container/container_routes.go#L526
Do you have any potential workarounds? Or a planned fix? I am trying to determine if it makes sense to go down the rabbit hole of pre-creating all of the needed directories.
Someone has to figure out exactly what's going over the wire and whether the issue is on the client or server side. I don't think there are any workarounds outside of pre-creating the directories on the host.
https://github.com/Mirantis/cri-dockerd/pull/346 ought to solve this; would you mind testing a build off of master?
That being said, I think we should keep this issue open until we have a regression test.
Is there a deb package built from master, or would I need to build from master? Currently we are using the deb package to install.
You would need to build from master; there are instructions and it is as trivial as a go build
and moving the binary into the bin directory. Obviously that's not ideal and you'd want a release for production, but hopefully it validates the fix for you (and you'd get packages from the next patch release).
I compiled from master, and dropped the new binary on my cluster. I am still getting the same error. I tried setting the log level for the cri-docker service to debug, but it didn't produce anything useful.
After reading through the docker documentation, and the go docker libraries (Mount and Volume), I think this is simply the expected behavior when using docker mounts.
It looks like some more digging will have to be done to determine where the fault lies; however, this is not the intended behavior. Kubernetes requires implicit directory creation as it was based on the Engine Binds API, which had this default behavior. We specifically added a new option to the Mounts API to enable implicit directory creation in v23, so if it doesn't work, there is a bug either in the daemon, or in cri-dockerd.
I have the same problem with Promtail Pod. It tries to bind the path at /run/promtail but it can't. Normally, it should be created on the container's initial. Nodes using cri-dockerd 0.3.11 are working normally
Pod Event
Events:
Warning Failed 8m17s (x12 over 10m) kubelet Error: Error response from daemon: invalid mount config for type "bind": bind source path does not exist: /run/promtail
cri-dockerd version
$ cri-dockerd --version
cri-dockerd 0.3.12 (c2e3805)
Docker Information
$ docker version
Client: Docker Engine - Community
Version: 25.0.2
API version: 1.44
Go version: go1.21.6
Git commit: 29cf629
Built: Thu Feb 1 00:22:57 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 25.0.2
API version: 1.44 (minimum version 1.24)
Go version: go1.21.6
Git commit: fce6e0c
Built: Thu Feb 1 00:22:57 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.28
GitCommit: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc:
Version: 1.1.12
GitCommit: v1.1.12-0-g51d5e94
docker-init:
Version: 0.19.0
GitCommit: de40ad0
$ docker info
Client: Docker Engine - Community
Version: 25.0.2
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.12.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.24.5
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 27
Running: 25
Paused: 0
Stopped: 2
Images: 24
Server Version: 25.0.2
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-102-generic
Operating System: Ubuntu 22.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 7.61GiB
Name: c3-pn-k8s-cp-01
ID: d9c0761d-e30b-482c-98b5-24129d5e370a
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: cthongrak
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
@corhere and @nwneisen are cooking a new 0.3 release which should revert the problematic change; though we still need to solve this for 0.4 in order to go forward.
Is there any minimal reproducer that does not depend on Calico?
Can't repro the issue with the following yaml
---
apiVersion: v1
kind: Pod
metadata:
name: bind
spec:
volumes:
- name: mnt
hostPath:
path: /tmp/non-existent
containers:
- name: busybox
image: busybox
args: ["sleep", "infinity"]
volumeMounts:
- name: mnt
mountPath: /mnt
(cri-dockerd v0.3.12, Docker v26.0.1, Kubernetes v1.30.0)
Not sure what may have changed, but this same error does not occur in v0.3.13.
@anfechtung v0.3.13 has the problematic change #311 reverted.
Still can't repro the issue with calico. I wonder if the issue might have been already fixed in a recent version of Docker?
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.3/manifests/tigera-operator.yaml
Used minikube v1.33 (Kubernetes v1.30.0, Docker v26.0.1, cri-dockerd v0.3.12, according to strings /usr/bin/cri-dockerd
)
Followed the "Operator" steps in https://docs.tigera.io/calico/3.27/getting-started/kubernetes/minikube
Bad Docker versions: <= v24.0.9, <= v25.0.3 Good Docker versions: >= v25.0.4, >= v26.0.0
Seems fixed in https://github.com/moby/moby/compare/v25.0.3...v25.0.4
I was able to reproduce the error and fix. I followed the calico quickstart steps using minikube. This was all done using c2e3805
, v0.3.12
.
Using minikube v1.31.1, calico fails due to the missing mount
nneisen:~/code/cri-dockerd (master): minikube version
minikube version: v1.31.1
commit: fd3f3801765d093a485d255043149f92ec0a695f
nneisen:~/code/cri-dockerd (master): kubectl get pods -A
tigera-operator tigera-operator-786dc9d695-p86vw 0/1 CreateContainerError 0 24s
nneisen:~/code/cri-dockerd (master): kubectl describe pod tigera-operator-786dc9d695-p86vw -n tigera-operator
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 66s default-scheduler Successfully assigned tigera-operator/tigera-operator-786dc9d695-p86vw to minikube
Normal Pulling 66s kubelet Pulling image "quay.io/tigera/operator:v1.32.7"
Normal Pulled 60s kubelet Successfully pulled image "quay.io/tigera/operator:v1.32.7" in 5.814069007s (5.814076587s including waiting)
Warning Failed 10s (x6 over 60s) kubelet Error: Error response from daemon: invalid mount config for type "bind": bind source path does not exist: /var/lib/calico
Normal Pulled 10s (x5 over 60s) kubelet Container image "quay.io/tigera/operator:v1.32.7" already present on machine
After upgrading my minikube version to v1.33.0, calico is successful
nneisen:~/code/cri-dockerd (master): minikube version
minikube version: v1.33.0
commit: 86fc9d54fca63f295d8737c8eacdbb7987e89c67
nneisen:~/code/cri-dockerd (master): kubectl get pods -A
tigera-operator tigera-operator-6678f5cb9d-h7c9f 1/1 Running 0 10s
nneisen:~/code/cri-dockerd (master): kubectl describe pod tigera-operator-6678f5cb9d-h7c9f -n tigera-operator
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m4s default-scheduler Successfully assigned tigera-operator/tigera-operator-6678f5cb9d-h7c9f to minikube
Normal Pulling 3m3s kubelet Pulling image "quay.io/tigera/operator:v1.32.7"
Normal Pulled 2m59s kubelet Successfully pulled image "quay.io/tigera/operator:v1.32.7" in 4.388s (4.388s including waiting). Image size: 69724923 bytes.
Normal Created 2m59s kubelet Created container tigera-operator
Normal Started 2m59s kubelet Started container tigera-operator
We should document that
cc: @corhere @neersighted @AkihiroSuda
Expected Behavior
Prior to v0.3.12 we were able to successfully install calico cni provider using the tigera operator to a baremetal kubeadm managed kubernetes cluster.
Actual Behavior
When updating our process to use cri-docker v0.3.12 we see bind errors during calico deployment.
Initially the tigera-operator fails to deploy.
After manually creating the folder
/var/lib/calico
on the controller node, the tigera operator pod deploys, but calico cni pods fail withSteps to Reproduce the Problem
Specifications