Process exits after successful restoration in kubernetes pod

Description

We're using criu with CRI-O with the latest changes but I'm working on another approach that does dump and restore inside the container that's run by Kubernetes using containerd. However, even though success is reported in the restore logs, the process exits right after the restore. The program I prepared for reproducing is a simple Go program that prints its PID and counter value.

I have a working case as well that I added the steps for. The main difference is that if I start the process in a separate terminal instead of letting k8s start it as usual, the dump I get for that process runs after the restoration. Running it in a separate terminal assigns SID and PGID that are different than 1 as I confirmed by inspecting the produced images using crit. I had thought that maybe the problem is SID mismatch but the non-working case has the criu restore running as part of SID 1 already. It feels likely because of SID and/or attaching file descriptors of stdin/stdout but I couldn't be sure.

Folder to run all these commands: reproduce.zip

Steps to reproduce non-working case:

Build the docker image docker build --platform linux/amd64,linux/arm64 -t muvaf/criu-in-pod:v1.0 .
- Push to a registry or pre-load the image to the nodes.
Run it in a cluster: kubectl apply -f to-dump.yaml. (I used GKE)
- Kubernetes: 1.29.0
- containerd: 1.7.10
- Linux Kernel: 6.1.58
Check the logs of the pod: kubectl logs to-dump.
Dump the process.
- kubectl exec -it to-dump -- bash
- mkdir /root/images && criu dump -t 9288 --shell-job -D /root/images -v4 --leave-stopped --evasive-devices --display-stats
- crit x /root/images ps - You'll see PID is 9288 but SID and PGID are 1.
- exit
Get the images from the pod: rm -rf images/ && mkdir images/ && kubectl cp to-dump:/root/images/ images/.
Build a Docker image with the images: docker build --platform linux/amd64,linux/arm64 -t muvaf/criu-in-pod:v1.0-restore -f Dockerfile-restore ..
- Push to a registry or pre-load the image to the nodes.
Run the restored pod: kubectl apply -f to-restore.yaml.
Check the logs of the restored pod: kubectl logs to-restore and see it exits after restore success message.

Steps to reproduce working case:

The same steps for building the first image but instead of directly running the counter binary, I put echo 9287 > /proc/sys/kernel/ns_last_pid && sleep 99999 in the to-dump.yaml file's args list.
Run kubectl apply -f to-dump.yaml.
- Run kubectl logs to-dump and see nothing as it sleeps.
Run kubectl exec -it to-dump -- bash and start the counter binary manually and keep that terminal tab open.
Run another kubectl exec -it to-dump -- bash and mkdir /root/images && criu dump -t $(pgrep counter) --shell-job -D /root/images -v4 --leave-stopped --evasive-devices --display-stats
- In that session, run crit x /root/images ps and see that PID is 9297 but now PGID is 9297 and SID is 9289.
- exit
In your local terminal, get the images via rm -rf images/ && mkdir images/ && kubectl cp to-dump:/root/images/ images/.
Build an image with the images (different tag): docker build --platform linux/amd64,linux/arm64 -t muvaf/criu-in-pod:v1.1-restore -f Dockerfile-restore ..
- Push to a registry or pre-load the image to the nodes.
Put sleep 99999 in the to-restore.yaml file in the args list instead of the restore command (after -c).
Change image in to-restore.yaml with the new tag (v1.1-restore instead of v1.0-restore)
Run it kubectl apply -f to-restore.yaml.
Run kubectl exec -it to-restore -- bash.
- Run crit x /root/images ps to check PID, PGID, and SID.
- Restore with criu restore -D /root/images --shell-job -v4.
- Success!

Describe the results you received: In the non-working case, the restore command succeeds as seen in the restore logs but the process never starts and container exits - no log of the counter process is seen.

Describe the results you expected:

I'd have expected the process to run after restoration and print the counter values when the pod is started with the criu restore command. The working case is not very useful as it doesn't attach the logs of the process to Kubernetes' log stream and also failure/success isn't reported back since it's in a different session.

Additional information you deem important (e.g. issue happens only occasionally):

Consistently happening.

CRIU logs and information:

dump.log restore.log

Output of `criu --version`:

``` Version: 3.19 ```

Output of `criu check --all`:

``` Warn (criu/kerndat.c:1593): CRIU was built without libnftables support Warn (criu/kerndat.c:1243): Can't keep kdat cache on non-tempfs Error (criu/cr-check.c:1223): UFFD is not supported Error (criu/cr-check.c:1223): UFFD is not supported Warn (criu/cr-check.c:1346): Nftables based locking requires libnftables and set concatenations support Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. ```

Additional environment details:

Kubernetes: 1.29.0
containerd: 1.7.10
Container Base Image: Ubuntu 22.04
- criu installed from ppa:criu/ppa
Linux Kernel of nodes: 6.1.58

Thanks for the very detailed report. Not sure we can actually help you.

I am not totally sure what you are looking for, but one thing which I seem to understand is that it probably will never work the way you want. It seems to be important to you that the console output of the restored process is captured by Kubernetes in the end. If you exec into the container and do a restore the console will never be the one attached to the Kubernetes log. It will always be attached to the shell you use to exec into.

You are mentioning CRI-O, and I as one of the main authors of the checkpoint/restore support in Kubernetes and CRI-O are wondering why do you not use that.

I guess you need to run CRIU as the entrypoint in your restore container. That way you might get the console output in Kubernetes. Does that make some sense to you?

It seems to be important to you that the console output of the restored process is captured by Kubernetes in the end. If you exec into the container and do a restore the console will never be the one attached to the Kubernetes log. It will always be attached to the shell you use to exec into.

Yes, exec'ing for running the restore command was just to get it working in some way. It's not just logs actually; capturing the exit code and also not having to maintain an up and running kubectl exec are also things I'd like to achieve by just having the criu restore as part of args that Kubernetes executes.

You are mentioning CRI-O, and I as one of the main authors of the checkpoint/restore support in Kubernetes and CRI-O are wondering why do you not use that.

I'm trying this approach because GKE doesn't have the option to use CRI-O, only containerd and provides little to none support to talking directly with the kubelet in the nodes. In the provided YAMLs (thanks to the great documentation you've got!) I was able to overcome a couple of problems by setting securityContext.privileged to true and mounting ns_last_pid to make sure I get a high PID that won't collide during restore in another container.

I guess you need to run CRIU as the entrypoint in your restore container. That way you might get the console output in Kubernetes. Does that make some sense to you?

Yes, this is effectively what I'm trying to achieve but results in exit right after restore command without getting to print anything. This is how I'm starting the pod:

apiVersion: v1
kind: Pod
metadata:
  name: to-restore
spec:
  containers:
  - name: main
    image: muvaf/criu-in-pod:v1.1-restore
    imagePullPolicy: Always
    command:
    - /bin/sh
    args:
    - -c
    - "criu restore -D /root/images --shell-job -v4"
    # - "sleep 999999"
    securityContext:
      privileged: true
    resources:
      limits:
        cpu: "1"
        memory: "1Gi"
    volumeMounts:
    - mountPath: /proc/sys/kernel/ns_last_pid
      name: checkpoint-cap
  volumes:
  - name: checkpoint-cap
    hostPath:
      path: /proc/sys/kernel/ns_last_pid
      type: File

The confusing part for me is that the restore command does report success as you can see in the attached log file. So I'm guessing the process dies right after criu starts it but I couldn't figure out why as it doesn't print anything at all. SID and PGID difference is the only thing that caught my attention between working and non-working cases.

Unrelated, but mount /proc/sys/kernel/ns_last_pid seems unnecessary. According to the logs clone3() is used and not ns_last_pid.

Not sure how you could find out why the process dies. You probably need to reproduce it locally and then you would be able to see why the process dies. Maybe using post-resume scripts to collect data from the container why the process died. Not sure.

Unrelated, but mount /proc/sys/kernel/ns_last_pid seems unnecessary. According to the logs clone3() is used and not ns_last_pid.

Thanks! I did that to make sure the process gets a high PID so that I don't have collision in restored container but you're right it's not necessary. Now I tried starting the counter with /bin/sh -c counter but changed the restore pod to start directly with criu restore so the restored process got the same PID of 7 👍

You probably need to reproduce it locally and then you would be able to see why the process dies.

I just reproduced it with docker container as well but then tried to change the program to write to a file instead of stdout and it worked! So I think what's going on is that the restored Go program exits the moment it tries to write to stdout. Specifically, when I make it to write /tmp/logs.txt, the restored program continues from where it left given that I make that file available in that container as well. However, when I tell it to write to /proc/self/fd/1, it exits right after restore. Somehow, criu itself is able to write to /proc/self/fd/1 as it prints all of its logs, but the restored program can't do that. Does that problem sound familiar? I'll give inheritance of external files a shot to see if it'd help to give /proc/self/fd/1 as inherited file.

For criu, process stdout and stderr are just pipes. CRIU restores them like any other resources, but it knows only about one end of each pipe, so it closes another end. Actually, these pipes are external resources and CRIU needs some help to restore them properly:

before executing criu dump, let's save inode numbers of process stdout and stderr pipes:

readlink /proc/9288/fd/1 > /root/images/stdout-pipe &&
readlink /proc/9288/fd/2 > /root/images/stderr-pipe &&
criu dump -t 9288 --shell-job -D /tmp/images -v4 --leave-stopped --evasive-devices --display-stats

On restore, we need to specify what file descriptors should be used for these pipes:

$ criu restore -o restore.log -D /tmp/images --shell-job -v4 --inherit-fd fd[1]:$(cat /tmp/images/stdout-pipe) --inherit-fd fd[2]:$(cat /tmp/images/stderr-pipe)
My pid is 9288
Count is 1723
My pid is 9288
Count is 1724
My pid is 9288
Count is 1725

More info about restoring external file descriptors can be found here: https://criu.org/Inheriting_FDs_on_restore

Thanks @avagin ! That's what I figured too by reading how runc does it here - they're doing exactly like your example by saving to a JSON in the dump folder and read from it during restore. I implemented as well, saw success and was about to share my findings here.

Example list of file descriptors of a process running inside a container (managed by runc-containerd-docker).

$ crit x . fds
9239
          0: /dev/null
          1: pipe[205347]
          2: pipe[205348]
          3: EVENTPOLL.5
          4: pipe[208089]
          5: pipe[208089]
        cwd: /
       root: /

For criu, process stdout and stderr are just pipes.

However, I don't think it's as straight-forward. When the process is started in detached mode, that's what happens as the container manager gets to be the middleman between the user and the output of the process. But if you run the process directly (for example runc run mycontainer or just from your terminal), it does get assigned TTYs.

$ crit x . fds
2558
          0: TTY.2
          1: TTY.2
          2: TTY.2
          3: EVENTPOLL.4
          4: pipe[33822]
          5: pipe[33822]
        cwd: /tmp
       root: /

When criu restores this, it runs a notify action called orphan-pts-master that's only available in swrk mode and if you talk with criu swrk through a Unix connection - not file, because that's the only way to pass a file descriptor using SCM_RIGHTS. I was deep into trying to figure out why I don't get that notify with go-criu and I think that's because they were pipes all along but also because go-criu uses the file directly, hence doesn't process the out of band data coming with the message. I'm planning to open a PR to make go-criu use unix connection like runc does and have that notify action as well. Though turns out none of that was needed in my case, as you pointed out it gets pipes instead of TTYs but I got to discover that way later 🙂

For future readers, another problem was that when criu restores but detaches itself from the restored process, container runtime may kill the process. I'm not sure if having a proper handling of reparenting in my PID 1 would've help yet but keeping the criu restore as the process's parent helped that sudden exit. In the end, I was able to dump and restore a simple counter in Kubernetes from within the container.

I'm closing the issue as there is a way to achieve this goal without any changes to criu.

checkpoint-restore / criu

Process exits after successful restoration in kubernetes pod #2353