runc/crun, cgroups and CRIU

adrianreber commented 2 years ago

I am currently looking at a problem concerning CRIU and OCI containers. My understanding so far is the following:

I am creating a checkpoint with manage_cgroups not set. This means we should have opts.manage_cgroups = CG_MODE_DEFAULT which is set to #define CG_MODE_DEFAULT (CG_MODE_SOFT).

Creating a checkpoint CRIU still tracks the information about the cgroup of the process in the container.

My understanding is that this should not be necessary, as (crun at least) will move the process after restore in the new cgroup created by crun. I think this is the only right approach. CRIU should, in case of OCI containers, not touch the cgroup setting. If the container is restored it will be restored with a newly created cgroup by the container runtime (crun/runc).

Setting #define CG_MODE_DEFAULT (CG_MODE_IGNORE) I still get a cgroup.img and core-1.img references cgroups via "cg_set": 2,.

The restore fails with:

(00.003375)      1: cg: Move into 2
(00.003391)      1: cg: setting cgns prefix to /machine.slice/libpod-dd47c09e12569883f67d88a5da89cbd2e1c450b2f3803087ee72e3a062a05186.scope/container
(00.003415)      1: Error (criu/cgroup.c:1092): cg: Can't move 1 into unifie//machine.slice/libpod-dd47c09e12569883f67d88a5da89cbd2e1c450b2f3803087ee72e3a062a05186.scope/container/cgroup.procs (-1/-1): Bad file descriptor
(00.003427)      1: Error (criu/cgroup.c:1148): cg: couldn't set cgns prefix unifie//machine.slice/libpod-dd47c09e12569883f67d88a5da89cbd2e1c450b2f3803087ee72e3a062a05186.scope/container/cgroup.procs: Bad file descriptor
(00.003431)      1: Error (criu/cgroup.c:1171): cg: failed preparing cgns

So there is still a bug somewhere in the code because unifie//machine.slice does not look correct.

Using CRIU's manage_cgroup mode will result in CG_MODE_SOFT and the restore works, but the restore does strange things. First of all I see in the logs:

(00.001357) cg: Preparing cgroups yard (cgroups restore mode 0x4)
(00.001593) cg: Opening .criu.cgyard.cifCa8 as cg yard
(00.001613) cg:         Making controller dir .criu.cgyard.cifCa8/unifie ()
(00.001707) cg: Determined cgroup dir unifie/machine.slice/libpod-30325b748276c463e9f5e8db0f98662915f7372f7585287dcae81c8cd4d75636.scope/container already exist
(00.001713) cg: Skip restoring properties on cgroup dir unifie/machine.slice/libpod-30325b748276c463e9f5e8db0f98662915f7372f7585287dcae81c8cd4d75636.scope/container

Which again looks wrong from the used paths and it is still referencing old cgroup paths although the container has another ID and the container runtime created another ID.

To reproduce:

podman run -d quay.io/adrianreber/counter
podman container checkpoint --latest --export /tmp/dump.tar -R -k
podman container restore -i /tmp/dump.tar -n new -k

Looking at the restore log of the container new will show the message from above. The log can be found with podman inspect -l --format "{{.State.RestoreLog}}".

So this is actually a bug report that the cgroup handling is not correct from CRIU and also a question if CRIU should just completely ignore the cgroup settings when used in combination with crun/runc, because crun/runc will create a new cgroup for a new container and move the processes into it. Currently it does not seem possible to tell CRIU to completely ignore the cgroup even with CG_MODE_IGNORE.

@mihalicyn @avagin any ideas, suggestions or comments?

avagin commented 2 years ago

@adrianreber have you look at runsc code? I think we have the --cgroup-root option and it has to be set to the container cgroup root.

adrianreber commented 2 years ago

I have a possible fix in #1800 (works for me and Podman)

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

checkpoint-restore / criu

runc/crun, cgroups and CRIU #1793