Open iaguis opened 1 year ago
I'm observing similar behaviour on occassion. I have to force delete the pod and remove the infra resources myself.
I have the impression that often a pod is stuck in Termination if one of the containers in a pod fails to start, but I'm not able to reliably reproduce it.
Sometimes, when deleting a pod, the pod is stuck.
CAA daemonset logs
``` 2023/07/19 09:14:07 [adaptor/proxy] RemoveContainer: containerID:0147b2970c411952d85fd259c68ef6ebd41d63409a9ec4748ef1011f71de86ce 2023/07/19 09:14:14 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-0147b2970c411952d85fd259c68ef6ebd41d63409a9ec4748ef1011f71de86ce.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-0147b2970c411952d85fd259c68ef6ebd41d63409a9ec4748ef1011f71de86ce.scope not loaded. 2023/07/19 09:18:14 [adaptor/proxy] RemoveContainer: containerID:7e24553eca916714ab0ebeeed0e56933d7cfc7e23a753e82b9b6eb68f4e261f9 2023/07/19 09:18:14 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-7e24553eca916714ab0ebeeed0e56933d7cfc7e23a753e82b9b6eb68f4e261f9.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-7e24553eca916714ab0ebeeed0e56933d7cfc7e23a753e82b9b6eb68f4e261f9.scope not loaded. 2023/07/19 09:18:14 [adaptor/proxy] RemoveContainer: containerID:20b8c21cdba09ec43e2efaef1460097b6aaf15944f4d484110a03f698aae1373 2023/07/19 09:18:14 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-20b8c21cdba09ec43e2efaef1460097b6aaf15944f4d484110a03f698aae1373.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-20b8c21cdba09ec43e2efaef1460097b6aaf15944f4d484110a03f698aae1373.scope not loaded. ```For reference, these are logs from a working delete:
``` 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer: containerID:8e6893f2119545fc26f511eee4e850eb2f9ca50a205e30fb40987440062a9475 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-8e6893f2119545fc26f511eee4e850eb2f9ca50a205e30fb40987440062a9475.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-8e6893f2119545fc26f511eee4e850eb2f9ca50a205e30fb40987440062a9475.scope not loaded. 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer: containerID:8489770b0b57e217e958498381261f7dfae2c8225808879619c6a3a20f3e6c90 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-8489770b0b57e217e958498381261f7dfae2c8225808879619c6a3a20f3e6c90.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-8489770b0b57e217e958498381261f7dfae2c8225808879619c6a3a20f3e6c90.scope not loaded. 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer: containerID:243765a0b8b2c34d226fb5d2cf39eb619db290aa7ed63ef3a677541f3ea6577a 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-243765a0b8b2c34d226fb5d2cf39eb619db290aa7ed63ef3a677541f3ea6577a.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-243765a0b8b2c34d226fb5d2cf39eb619db290aa7ed63ef3a677541f3ea6577a.scope not loaded. 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer: containerID:3fee5da9022310dd154f723018868e1fc3c2296789656a4f0e44a3dc2283c313 2023/07/19 09:58:59 [adaptor/proxy] RemoveContainer fails: rpc error: code = Internal desc = destroy cgroups Caused by: 0: failed to stop unit cri-containerd-3fee5da9022310dd154f723018868e1fc3c2296789656a4f0e44a3dc2283c313.scope 1: org.freedesktop.systemd1.NoSuchUnit: Unit cri-containerd-3fee5da9022310dd154f723018868e1fc3c2296789656a4f0e44a3dc2283c313.scope not loaded. 2023/07/19 09:58:59 [adaptor/proxy] DestroySandbox 2023/07/19 09:58:59 [adaptor/proxy] DestroySandbox fails: rpc error: code = Internal desc = No such file or directory (os error 2) 2023/07/19 09:58:59 [adaptor/proxy] shutting down socket forwarder 2023/07/19 09:59:40 [adaptor/cloud/azure] deleted VM successfully: podvm-nginx-pv-3fee5da9 2023/07/19 10:00:10 [adaptor/cloud/azure] deleted disk successfully: podvm-nginx-pv-3fee5da9-disk 2023/07/19 10:00:14 [adaptor/cloud/azure] deleted network interface successfully: podvm-nginx-pv-3fee5da9-net 2023/07/19 10:00:14 [adaptor/cloud] failed to release PeerPod pod to PeerPod mapping not found 2023/07/19 10:00:14 [tunneler/vxlan] Delete tc redirect filters on eth0 and eth0 in the network namespace /var/run/netns/cni-1ea9b076-8a92-ef33-29ff-18487df74691 2023/07/19 10:00:15 [tunneler/vxlan] Delete vxlan interface vxlan1 in the network namespace /var/run/netns/cni-1ea9b076-8a92-ef33-29ff-18487df74691 ```Deleting containers manually with crictl doesn't work:
A containerd-shim-kata-v2 process is stuck:
Killing the process allows containers to be killed:
I suspect it's some kind of timing issue. We used to have a cloud-init bug in the Azure podvm image that made starting it slower, but after switching to a podvm image that has that bug fixed I haven't managed to reproduce this issue yet.