checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.87k stars 582 forks source link

How to run criu inside a container to dump/restore process ? #2076

Open cyyzero opened 1 year ago

cyyzero commented 1 year ago

Description

I'm trying to integrate criu and apptainer , which is a popular container software in high performance computing environments. The initial solution is to exec CRIU inside the container via some build-in commands. I encountered an error while trying to do a dump. My intuition tells me it might have something to do with /dev being external bind mounted into the container.

The dumped process redirected its stdin to /dev/null, and /dev/ is mounted from host.

CRIU dump cmd is criu dump --unprivileged --tree $PID --images-dir $IMG_DIR --work-dir $WORK_DIR --shell-job -v4 --log-file dump.log.

Additional information you deem important (e.g. issue happens only occasionally):

CRIU logs and information:

CRIU full dump/restore logs:

``` (00.038535) Dumping opened files (pid: 57) (00.038576) ---------------------------------------- (00.038600) Sent msg to daemon 71 0 0 pie: 57: __fetched msg: 71 0 0 pie: 57: __sent ack msg: 71 71 0 pie: 57: Daemon waits for command (00.038653) Wait for ack 71 on daemon socket (00.038660) Fetched ack: 71 71 0 (00.038677) 57 fdinfo 0: pos: 0 flags: 100000/0 (00.038834) Error (criu/files-reg.c:1817): Can't lookup mount=62 for fd=0 path=/dev/null (00.038841) ---------------------------------------- (00.038882) Error (criu/cr-dump.c:1675): Dump files (pid: 57) failed with -1 (00.038905) Waiting for 57 to trap (00.038938) Daemon 57 exited trapping (00.038947) Sent msg to daemon 3 0 0 pie: 57: __fetched msg: 3 0 0 pie: 57: 57: new_sp=0x7f6644491888 ip 0x7f664455e388 (00.106847) 57 was trapped (00.107052) 57 was trapped (00.107062) 57 (native) is going to execute the syscall 15, required is 15 (00.107213) 57 was stopped (00.107540) Unlock network (00.108080) Unfreezing tasks into 1 (00.108089) Unseizing 57 into 1 (00.108111) Unseizing 58 into 1 (00.108139) Error (criu/cr-dump.c:2099): Dumping FAILED. ```

Output of `criu --version`:

``` Version: 3.17 GitID: v3.17-117-g50db2be1a ```

Additional environment details:

host mountinfo:

```sh $ cat /proc/self/mountinfo 53 60 0:27 / /mnt/wsl rw,relatime shared:1 - tmpfs none rw 54 60 0:29 / /usr/lib/wsl/drivers ro,nosuid,nodev,noatime - 9p drivers ro,dirsync,aname=drivers;fmask=222;dmask=222,mmap,access=client,msize=65536,trans=fd,rfd=7,wfd=7 58 60 0:33 / /usr/lib/wsl/lib rw,relatime - overlay none rw,lowerdir=/gpu_lib_packaged:/gpu_lib_inbox,upperdir=/gpu_lib/rw/upper,workdir=/gpu_lib/rw/work 60 44 8:32 / / rw,relatime - ext4 /dev/sdc rw,discard,errors=remount-ro,data=ordered 61 60 0:2 /init /init rw - rootfs rootfs rw,size=4020640k,nr_inodes=1005160 62 60 0:5 / /dev rw,nosuid,relatime - devtmpfs none rw,size=4020668k,nr_inodes=1005167,mode=755 63 60 0:20 / /sys rw,nosuid,nodev,noexec,noatime - sysfs sysfs rw 64 60 0:38 / /proc rw,nosuid,nodev,noexec,noatime - proc proc rw 65 62 0:39 / /dev/pts rw,nosuid,noexec,noatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000 66 60 0:40 / /run rw,nosuid,nodev - tmpfs none rw,mode=755 67 66 0:41 / /run/lock rw,nosuid,nodev,noexec,noatime - tmpfs none rw 68 66 0:42 / /run/shm rw,nosuid,nodev,noatime - tmpfs none rw 69 66 0:43 / /run/user rw,nosuid,nodev,noexec,noatime - tmpfs none rw,mode=755 70 64 0:28 / /proc/sys/fs/binfmt_misc rw,relatime - binfmt_misc binfmt_misc rw 71 63 0:44 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755 72 71 0:45 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw 73 71 0:46 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset 74 71 0:47 / /sys/fs/cgroup/cpu rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu 75 71 0:48 / /sys/fs/cgroup/cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuacct 76 71 0:49 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio 77 71 0:26 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory 78 71 0:50 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices 79 71 0:51 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer 80 71 0:52 / /sys/fs/cgroup/net_cls rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls 81 71 0:53 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event 82 71 0:54 / /sys/fs/cgroup/net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_prio 83 71 0:55 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb 84 71 0:56 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids 85 71 0:57 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma 86 71 0:58 / /sys/fs/cgroup/misc rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,misc 131 60 0:59 / /mnt/c rw,noatime - 9p drvfs rw,dirsync,aname=drvfs;path=C:\;uid=1000;gid=1000;symlinkroot=/mnt/,mmap,access=client,msize=262144,trans=virtio 132 60 0:60 / /mnt/d rw,noatime - 9p drvfs rw,dirsync,aname=drvfs;path=D:\;uid=1000;gid=1000;symlinkroot=/mnt/,mmap,access=client,msize=262144,trans=virtio 133 60 8:32 /var/lib/docker /var/lib/docker rw,relatime shared:2 - ext4 /dev/sdc rw,discard,errors=remount-ro,data=ordered ```

container mountinfo:

(Output is too long, so I upload a text file.) [container-mountinfo.txt](https://github.com/checkpoint-restore/criu/files/10578553/container-mountinfo.txt)

adrianreber commented 1 year ago

My intuition tells me it might have something to do with /dev being external bind mounted into the container.

That sounds right. All mounts from the outside of the container need to be marked as external. Running CRIU in a OCI container (Docker/Podman) usually works without any additional parameters as all the mounts are usually setup correctly.

For runc/crun checkpointing all external mount points into the container need be part of the container configuration. Usually that is config.json. runc/crun marks all external mounts before calling CRIU and so CRIU knows about them.

First try, based on your information, would probably be to mark /tmp/rootfs-4249390345/root/dev as external. But there are a lot of those mounts from the outside. So if you have a way to ask apptainer for those mounts that would make it easier for you.

wenhuizhang commented 1 year ago

@adrianreber will go into the namespace using "nsenter -t container_PID --net bash", and mount from inside of the namespace work?

adrianreber commented 1 year ago

@adrianreber will go into the namespace using "nsenter -t container_PID --net bash", and mount from inside of the namespace work?

This just enters the network namespace, right? Not sure how that would help.

There are container engines (like Docker and Podman) which enable you to run CRIU inside of the container. Please take a look at those and see how they are set up. Would be nice if this work also in apptainer. Although I am not sure if that is possible with apptainer. It seems like apptainer exposes a lot of host directories into the container because that is important for MPI applications (at least that is what I remember about it, never used it myself). Like already said, with a very long list of directories marked as external, based on the mountinfo, it might work.

If you goal is to integrate CRIU into apptainer, I am not sure there is value in figuring out how to run CRIU in the container. There are multiple container runtimes out there which support checkpointing of containers: crun, lxc, runc, youki (partially). I think taking a look at those and see if you can do the same in apptainer would be a good approach.

wenhuizhang commented 1 year ago

@adrianreber will go into the namespace using "nsenter -t container_PID --net bash", and mount from inside of the namespace work?

This just enters the network namespace, right? Not sure how that would help.

There are container engines (like Docker and Podman) which enable you to run CRIU inside of the container. Please take a look at those and see how they are set up. Would be nice if this work also in apptainer. Although I am not sure if that is possible with apptainer. It seems like apptainer exposes a lot of host directories into the container because that is important for MPI applications (at least that is what I remember about it, never used it myself). Like already said, with a very long list of directories marked as external, based on the mountinfo, it might work.

If you goal is to integrate CRIU into apptainer, I am not sure there is value in figuring out how to run CRIU in the container. There are multiple container runtimes out there which support checkpointing of containers: crun, lxc, runc, youki (partially). I think taking a look at those and see if you can do the same in apptainer would be a good approach.

Thanks for the list of CRIU compatible container runtimes , will stick to runc then. ;)

cyyzero commented 1 year ago

@adrianreber will go into the namespace using "nsenter -t container_PID --net bash", and mount from inside of the namespace work?

This just enters the network namespace, right? Not sure how that would help.

There are container engines (like Docker and Podman) which enable you to run CRIU inside of the container. Please take a look at those and see how they are set up. Would be nice if this work also in apptainer. Although I am not sure if that is possible with apptainer. It seems like apptainer exposes a lot of host directories into the container because that is important for MPI applications (at least that is what I remember about it, never used it myself). Like already said, with a very long list of directories marked as external, based on the mountinfo, it might work.

If you goal is to integrate CRIU into apptainer, I am not sure there is value in figuring out how to run CRIU in the container. There are multiple container runtimes out there which support checkpointing of containers: crun, lxc, runc, youki (partially). I think taking a look at those and see if you can do the same in apptainer would be a good approach.

Thank you for your suggestion! Apptainer is container runtime for multi-user scenarios, which is determined by the way HPC clusters are used. So the integration solution needs to consider security factors. Inspired by this report, I decided to run CRIU inside the container, limit its scope(with pid namespace enabled) and use the --unprivileged option to strip root identity.

wenhuizhang commented 1 year ago

@adrianreber will go into the namespace using "nsenter -t container_PID --net bash", and mount from inside of the namespace work?

This just enters the network namespace, right? Not sure how that would help.

There are container engines (like Docker and Podman) which enable you to run CRIU inside of the container. Please take a look at those and see how they are set up. Would be nice if this work also in apptainer. Although I am not sure if that is possible with apptainer. It seems like apptainer exposes a lot of host directories into the container because that is important for MPI applications (at least that is what I remember about it, never used it myself). Like already said, with a very long list of directories marked as external, based on the mountinfo, it might work.

If you goal is to integrate CRIU into apptainer, I am not sure there is value in figuring out how to run CRIU in the container. There are multiple container runtimes out there which support checkpointing of containers: crun, lxc, runc, youki (partially). I think taking a look at those and see if you can do the same in apptainer would be a good approach.

It seems like K8s with CRIU's migration did not call criu restore in runc, however it calls create and start in runc? Is there some patches to runc, to get the create and start process calls criu restore in this repo please?

adrianreber commented 1 year ago

It seems like K8s with CRIU's migration did not call criu restore in runc, however it calls create and start in runc? Is there some patches to runc, to get the create and start process calls criu restore in this repo please?

There is something you misunderstood. On the runc level it is a normal restore.

wenhuizhang commented 1 year ago

It seems like K8s with CRIU's migration did not call criu restore in runc, however it calls create and start in runc? Is there some patches to runc, to get the create and start process calls criu restore in this repo please?

There is something you misunderstood. On the runc level it is a normal restore.

Thanks so much for your instructions. Any hints about patches to this part on containerd or K8s layer please? I would like to learn what the process is to let containerd restore the container please? ( such as how fs-diff is patched to rootfs, and img restore to process context)

adrianreber commented 1 year ago

It is all part of CRI-O. No external patches necessary.

wenhuizhang commented 1 year ago

It is all part of CRI-O. No external patches necessary.

Got it , thanks !

wenhuizhang commented 1 year ago

It is all part of CRI-O. No external patches necessary.

listened to your talk last week, any hints on how to try out the POC examples in the talk please, maybe share the scripts for the demo in a test folder of criu repo please?

adrianreber commented 1 year ago

It is all part of CRI-O. No external patches necessary.

listened to your talk last week, any hints on how to try out the POC examples in the talk please, maybe share the scripts for the demo in a test folder of criu repo please?

Not really relevant here, but there is nothing special. Just start a container in Kubernetes and for checkpointing please see:

https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

wenhuizhang commented 1 year ago

Thanks :)

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.