Containerd restart failed with message "failed to recover state: failed to reserve container name xxx: name xxx is reserved for xxx"

payall4u commented 2 years ago

Description

We use kubelet and containerd. Restart containerd failed because cri found the same container name.

8月 02 22:56:05 VM-0-29-centos containerd[36948]: time="2022-08-02T22:56:05.989055410+08:00" level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to reserve container name \"kube-proxy_kube-proxy-m28fw_kube-system_0df69e5f-4355-4f99-bfd0-d1c6b2f935aa_0\": name \"kube-proxy_kube-proxy-m28fw_kube-system_0df69e5f-4355-4f99-bfd0-d1c6b2f935aa_0\" is reserved for \"73cc6d80cd6602e5ff53fd62db85cf09ecc8fe12b9effe753c404bf45750842a\"" 8月 02 22:56:05 VM-0-29-centos systemd[1]: containerd.service: Main process exited, code=exited, status=1/FAILURE

It's easy to reproduce, as long as you fill up the disk.

CRI will load all containers., when restart. And the container will be skipped, if load failed. And kubelet will create a container/sandbox with same name and restrartCount, because kubelet get restrartCount from annotation of the existed container. If loading the missing container succeeds, on the next restart, cri will find the container with the same name, so it will panic.

Steps to reproduce the issue

Find a common node with containerd as runtime
Fill the disk with commond such as dd if=/dev/zero of=file bs=1M count=1024
Restart containerd and we'll go a message "Failed to load container xxx error=failed to checkpoint status to xxx/.tmp-status106398678: no space left on device""
Kubelet will create new containers will same name
rm file
systemctl restart containerd

Describe the results you received and expected

Noop

What version of containerd are you using?

1.4.3

Any other relevant information

no matter

Show configuration if it is related to CRI plugin.

no matter

buccella commented 2 years ago

Looks similar to #6095

Goend commented 1 year ago

@payall4u 你好请教下这个问题最终有破坏性不那么强的解决方式么？我看他们说的都是清空containerd的数据

cardyok commented 1 year ago

Since containerd stores persistent data in boltdb and relys on it during restart. If containers are being created with disk filled up, data inconsistency is definitely going to happen..I don't think there is anything Containerd can do to accomodate for disk full scenario..

Goend commented 1 year ago

@cardyok 因此你的建议是在磁盘扩容之后对数据进行清理来重建boltdb 和别的，是这样么？但我始终觉得这不是一个应该由用户去清理数据解决的问题在磁盘扩容后 containerd应该正常运行才对

payall4u commented 1 year ago

It's a trick.

Update containerd config and disable CRI plugin.

Then containerd can be restarted correctly. Use command ctr -n k8s.io c ls to find out the reserved containerd. Remove it pls.

Goend commented 1 year ago

@payall4u ok.I get it,thanks

containerd / containerd