checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.79k stars 565 forks source link

Cannot checkpoint container: runc did not terminate successfully with mount error #2271

Closed liunan-ms closed 10 months ago

liunan-ms commented 10 months ago

Description Hello,

I'm trying to get container checkpoints working on an AKS node, but when I try to checkpoint a simple example container with containerd client I am getting a criu error.

Steps to reproduce the issue:

  1. Deploy a workload pod with this yaml file
  2. Get the container id in the workload pod: kubectl describe pod workload
  3. Copy the checkpoint utility to /host and change the root directory to /host, then run /checkpoint with container id:
    cp /checkpoint /host
    chroot /host /checkpoint ${container id}

Describe the results you received: Attempting to open containerd client connection... Successfully opened containerd client connection! Checkpointing container 8ea494484c96cabe56c5595fda1e3cf3555bf5ecbc1876c9680f0c7a2f60d00f... 2023/09/27 17:05:31 /usr/bin/runc did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/k8s.io/8ea494484c96cabe56c5595fda1e3cf3555bf5ecbc1876c9680f0c7a2f60d00f/criu-dump.log: unknown

Describe the results you expected: Attempting to open containerd client connection... Successfully opened containerd client connection! Checkpointing container 8ea494484c96cabe56c5595fda1e3cf3555bf5ecbc1876c9680f0c7a2f60d00f... Checkpoint created

Additional information you deem important (e.g. issue happens only occasionally): /etc/criu/default.confand /etc/criu/runc.conf don't exist.

CRIU logs and information:

CRIU full dump/restore logs:

[criu-dump.log](https://github.com/liunan-ms/Wormhole/blob/main/logs/criu-dump.log)

Output of `criu --version`:

``` Version: 3.18 ```

Output of `criu check --all`:

``` Looks good. ```

Additional environment details: containerd --version: containerd github.com/containerd/containerd 1.6.22 8165feabfdfe38c65b599c4993d227328c231fca

runc --version: runc version 1.1.9 commit: ccaecfcbc907d70a7aa870a6650887b901b25b82 spec: 1.0.2-dev go: go1.20.7 libseccomp: 2.5.3

adrianreber commented 10 months ago

I am a bit confused about your report.

What is /host and what is /checkpoint. /host sounds like a path, but what is /checkpoint? You talk about "the checkpoint utility". Is that CRIU? But then you do /checkpoint ${container id}. Which cannot work because CRIU does not know about container IDs. You also mention runc errors which also means your are not calling CRIU directly.

Why do you use chroot? This is all very confusing and it is not clear what you are trying to do?

If you are trying to checkpoint a containerd container you should use the containerd interface to do it. How to do it is a question for containerd and not CRIU.

If you are trying to checkpoint a Kubernetes container you can take a look at: https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

Maybe the first step would be what are you trying to do and why are you doing what you do? From my point of view it does not make much sense what you are trying to, but maybe I am just misunderstanding what you are trying to do.

What is AKS?

liunan-ms commented 10 months ago

Sorry for the confusion! I added the link to the checkpoint utility code which is using containerd API to checkpoint and restore containers from a source VM to a destination VM. I have a server which calls the checkpoint utility to checkpoint the container running in separate pod. The server pod is mounted on /host so I would like to change the root directory to /host and run the checkpoint utility. I figured out why criu has this mount error, which is because both server pod and workload pod in which the to-be-migrated container is running are mounted on the same path, so there is a conflict while criu tries to checkpoint. AKS is Azure Kubernetes Service which I used to create my VMs.

Thanks for your reply and I will close this issue.