checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.76k stars 559 forks source link

Using CRIU with nested LXC containers #2426

Open alexfrolov opened 5 days ago

alexfrolov commented 5 days ago

Hi!

I am considering using CRIU for checkpoint/restore of nested LXC containers. Do I understand it right that in this case CRIU should be called inside parent container?

For example, as far as I understood CRIU is using mnt namespace of the CRIU process (/proc/self/ns/mnt) to resolve external mountpoints for target process, so this means that CRIU wont be able to work from the host level in case target process is running in nested container. Is that correct?

Thanks, Alex

avagin commented 5 days ago

For LXC, the auto mode for external mounts will not work, you need to enumerate them manually. But you still need to call CRIU from a parent container to dump a target pid namespace properly. Otherwise, it will look like two nested pid namespaces.

I don't recommend to use CRIU directly to dump/restore LXC containers. It should be easier to use lxc tools for that.

alexfrolov commented 4 days ago

Just to clarify about auto-detection of external mounts.

When I am running non-nested LXC container, it uses external mp at least for rootfs (besides, /proc stuff which extensivly uses fuse.lxcfs). For example, for Xenial-based container /proc/1/mountinfo looks like this:

root@u3:/home/ubuntu# cat /proc/1/mountinfo | grep master
690 619 253:1 /u3/rootfs / rw,relatime master:322 - ext4 /dev/mapper/dummy--vg-dummy--lv rw
696 697 0:34 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime master:18 - fusectl fusectl rw
700 692 0:42 /proc/cpuinfo /proc/cpuinfo rw,nosuid,nodev,relatime master:154 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
701 692 0:42 /proc/diskstats /proc/diskstats rw,nosuid,nodev,relatime master:154 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
702 692 0:42 /proc/loadavg /proc/loadavg rw,nosuid,nodev,relatime master:154 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
703 692 0:42 /proc/meminfo /proc/meminfo rw,nosuid,nodev,relatime master:154 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
761 692 0:42 /proc/slabinfo /proc/slabinfo rw,nosuid,nodev,relatime master:154 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
762 692 0:42 /proc/stat /proc/stat rw,nosuid,nodev,relatime master:154 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
763 692 0:42 /proc/swaps /proc/swaps rw,nosuid,nodev,relatime master:154 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
909 692 0:42 /proc/uptime /proc/uptime rw,nosuid,nodev,relatime master:154 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
910 697 0:42 /sys/devices/system/cpu /sys/devices/system/cpu rw,nosuid,nodev,relatime master:154 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other

So, this container can be easily dumped by the following command (APP_PID is the PID of /sbin/init in the host's pid ns):

sudo /home/ubuntu/criu/criu/criu dump --tcp-established --file-locks --link-remap --manage-cgroups=full --ext-mount-map auto --enable-external-sharing --enable-external-masters --enable-fs hugetlbfs --enable-fs tracefs -D /tmp/checkpobint-u3 -o /tmp/checkpobint-u3/dump.log --cgroup-root cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:lxc.payload.u3 -v4 --ext-mount-map /sys/fs/fuse/connections:sys/fs/fuse/connections -t $APP_PID --skip-in-flight --freeze-cgroup /sys/fs/cgroup/lxc.payload.u3 --force-irmap

So AFAIU, the option --ext-mount-map auto looks pretty usable for LXC containers in shared parent mountpoints or I'm missing something? BTW, what is manual enumeration of the mountpoints?

I agree that direct calling to criu for c/r of LXC containers not the best thing to do, but I want to understand how things work here...

Thank you!

avagin commented 1 day ago

So AFAIU, the option --ext-mount-map auto looks pretty usable for LXC containers in shared parent mountpoints or I'm missing something? BTW, what is manual enumeration of the mountpoints?

It works well before you need to restore it. I don't remember details it was a long time when I used it last time.

I agree that direct calling to criu for c/r of LXC containers not the best thing to do, but I want to understand how things work here...

When LXC starts a container, it creates a mount namespace for it and mounts rootfs and a few other mounts that depends on a container configuration (fusefs proc, external volumes, etc). For CRIU, all these mounts will be external. Only LXC knows how to proper mount them on restore. I don't know where this code in the LXC, but you can look at runsc, it should be similar: https://github.com/opencontainers/runc/blob/main/libcontainer/criu_linux.go#L108 https://github.com/opencontainers/runc/blob/main/libcontainer/criu_linux.go#L490