confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
44 stars 76 forks source link

Azure: when deleting a pod, container stuck on worker node #1223

Open iaguis opened 1 year ago

iaguis commented 1 year ago

Sometimes, when deleting a pod, the pod is stuck.

CAA daemonset logs ``` 2023/07/19 09:14:07 [adaptor/proxy] RemoveContainer: containerID:0147b2970c411952d85fd259c68ef6ebd41d63409a9ec4748ef1011f71de86ce 2023/07/19 09:14:14 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-0147b2970c411952d85fd259c68ef6ebd41d63409a9ec4748ef1011f71de86ce.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-0147b2970c411952d85fd259c68ef6ebd41d63409a9ec4748ef1011f71de86ce.scope not loaded. 2023/07/19 09:18:14 [adaptor/proxy] RemoveContainer: containerID:7e24553eca916714ab0ebeeed0e56933d7cfc7e23a753e82b9b6eb68f4e261f9 2023/07/19 09:18:14 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-7e24553eca916714ab0ebeeed0e56933d7cfc7e23a753e82b9b6eb68f4e261f9.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-7e24553eca916714ab0ebeeed0e56933d7cfc7e23a753e82b9b6eb68f4e261f9.scope not loaded. 2023/07/19 09:18:14 [adaptor/proxy] RemoveContainer: containerID:20b8c21cdba09ec43e2efaef1460097b6aaf15944f4d484110a03f698aae1373 2023/07/19 09:18:14 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-20b8c21cdba09ec43e2efaef1460097b6aaf15944f4d484110a03f698aae1373.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-20b8c21cdba09ec43e2efaef1460097b6aaf15944f4d484110a03f698aae1373.scope not loaded. ```
For reference, these are logs from a working delete: ``` 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer: containerID:8e6893f2119545fc26f511eee4e850eb2f9ca50a205e30fb40987440062a9475 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-8e6893f2119545fc26f511eee4e850eb2f9ca50a205e30fb40987440062a9475.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-8e6893f2119545fc26f511eee4e850eb2f9ca50a205e30fb40987440062a9475.scope not loaded. 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer: containerID:8489770b0b57e217e958498381261f7dfae2c8225808879619c6a3a20f3e6c90 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-8489770b0b57e217e958498381261f7dfae2c8225808879619c6a3a20f3e6c90.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-8489770b0b57e217e958498381261f7dfae2c8225808879619c6a3a20f3e6c90.scope not loaded. 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer: containerID:243765a0b8b2c34d226fb5d2cf39eb619db290aa7ed63ef3a677541f3ea6577a 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-243765a0b8b2c34d226fb5d2cf39eb619db290aa7ed63ef3a677541f3ea6577a.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-243765a0b8b2c34d226fb5d2cf39eb619db290aa7ed63ef3a677541f3ea6577a.scope not loaded. 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer: containerID:3fee5da9022310dd154f723018868e1fc3c2296789656a4f0e44a3dc2283c313 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-3fee5da9022310dd154f723018868e1fc3c2296789656a4f0e44a3dc2283c313.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-3fee5da9022310dd154f723018868e1fc3c2296789656a4f0e44a3dc2283c313.scope not loaded. 2023/07/19 09:58:59 [adaptor/proxy] DestroySandbox 2023/07/19 09:58:59 [adaptor/proxy] DestroySandbox fails: rpc error: code = Internal desc = No such file or directory (os error 2) 2023/07/19 09:58:59 [adaptor/proxy] shutting down socket forwarder 2023/07/19 09:59:40 [adaptor/cloud/azure] deleted VM successfully: podvm-nginx-pv-3fee5da9 2023/07/19 10:00:10 [adaptor/cloud/azure] deleted disk successfully: podvm-nginx-pv-3fee5da9-disk 2023/07/19 10:00:14 [adaptor/cloud/azure] deleted network interface successfully: podvm-nginx-pv-3fee5da9-net 2023/07/19 10:00:14 [adaptor/cloud] failed to release PeerPod pod to PeerPod mapping not found 2023/07/19 10:00:14 [tunneler/vxlan] Delete tc redirect filters on eth0 and eth0 in the network namespace /var/run/netns/cni-1ea9b076-8a92-ef33-29ff-18487df74691 2023/07/19 10:00:15 [tunneler/vxlan] Delete vxlan interface vxlan1 in the network namespace /var/run/netns/cni-1ea9b076-8a92-ef33-29ff-18487df74691 ```

Deleting containers manually with crictl doesn't work:

root@aks-nodepool1-40650097-vmss000000:/# crictl ps | grep nginx-pv
0147b2970c411       b6c621311b44a       19 minutes ago      Running             nginx                          0                   c60cc73149eb3       nginx-pv
20b8c21cdba09       3a20b792a6ebd       19 minutes ago      Running             csi-podvm-wrapper              0                   c60cc73149eb3       nginx-pv
7e24553eca916       4335937adcc26       19 minutes ago      Running             azure-file-podvm-node-driver   0                   c60cc73149eb3       nginx-pv
root@aks-nodepool1-40650097-vmss000000:/# crictl stop 7e24553eca916
E0719 09:24:53.631961 1252633 remote_runtime.go:505] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="7e24553eca916"
FATA[0002] stopping the container "7e24553eca916": rpc error: code = DeadlineExceeded desc = context deadline exceeded
root@aks-nodepool1-40650097-vmss000000:/# crictl stop 7e24553eca916
E0719 09:24:58.598433 1252697 remote_runtime.go:505] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="7e24553eca916"
FATA[0002] stopping the container "7e24553eca916": rpc error: code = DeadlineExceeded desc = context deadline exceeded
root@aks-nodepool1-40650097-vmss000000:/# crictl stop -t 2 7e24553eca916
E0719 09:25:50.945311 1253610 remote_runtime.go:505] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="7e24553eca916"
FATA[0004] stopping the container "7e24553eca916": rpc error: code = DeadlineExceeded desc = context deadline exceeded

A containerd-shim-kata-v2 process is stuck:

root     1240164  0.0  0.4 1415732 40260 ?       Sl   09:13   0:00 /opt/confidential-containers/bin/containerd-shim-kata-v2 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /opt/confidential-containers/bin/containerd -id c60cc73149eb3b90b78487ebfb4cd117a7ad7e2779021ed7881e9198acaa0748

Killing the process allows containers to be killed:

root@aks-nodepool1-40650097-vmss000000:/# kill 1240164
root@aks-nodepool1-40650097-vmss000000:/# crictl stop 7e24553eca916
7e24553eca916
root@aks-nodepool1-40650097-vmss000000:/# crictl stop -t 2 0147b2970c411
0147b2970c411
root@aks-nodepool1-40650097-vmss000000:/# crictl stop -t 2 a7fec7ea686f6
a7fec7ea686f6
root@aks-nodepool1-40650097-vmss000000:/# crictl ps | grep nginx-pv
root@aks-nodepool1-40650097-vmss000000:/#

I suspect it's some kind of timing issue. We used to have a cloud-init bug in the Azure podvm image that made starting it slower, but after switching to a podvm image that has that bug fixed I haven't managed to reproduce this issue yet.

mkulke commented 1 year ago

I'm observing similar behaviour on occassion. I have to force delete the pod and remove the infra resources myself.

I have the impression that often a pod is stuck in Termination if one of the containers in a pod fails to start, but I'm not able to reliably reproduce it.