Closed muvaf closed 4 months ago
Thanks for the very detailed report. Not sure we can actually help you.
I am not totally sure what you are looking for, but one thing which I seem to understand is that it probably will never work the way you want. It seems to be important to you that the console output of the restored process is captured by Kubernetes in the end. If you exec into the container and do a restore the console will never be the one attached to the Kubernetes log. It will always be attached to the shell you use to exec into.
You are mentioning CRI-O, and I as one of the main authors of the checkpoint/restore support in Kubernetes and CRI-O are wondering why do you not use that.
I guess you need to run CRIU as the entrypoint in your restore container. That way you might get the console output in Kubernetes. Does that make some sense to you?
It seems to be important to you that the console output of the restored process is captured by Kubernetes in the end. If you exec into the container and do a restore the console will never be the one attached to the Kubernetes log. It will always be attached to the shell you use to exec into.
Yes, exec'ing for running the restore command was just to get it working in some way. It's not just logs actually; capturing the exit code and also not having to maintain an up and running kubectl exec
are also things I'd like to achieve by just having the criu restore
as part of args
that Kubernetes executes.
You are mentioning CRI-O, and I as one of the main authors of the checkpoint/restore support in Kubernetes and CRI-O are wondering why do you not use that.
I'm trying this approach because GKE doesn't have the option to use CRI-O, only containerd
and provides little to none support to talking directly with the kubelet in the nodes. In the provided YAMLs (thanks to the great documentation you've got!) I was able to overcome a couple of problems by setting securityContext.privileged
to true
and mounting ns_last_pid
to make sure I get a high PID that won't collide during restore in another container.
I guess you need to run CRIU as the entrypoint in your restore container. That way you might get the console output in Kubernetes. Does that make some sense to you?
Yes, this is effectively what I'm trying to achieve but results in exit right after restore command without getting to print anything. This is how I'm starting the pod:
apiVersion: v1
kind: Pod
metadata:
name: to-restore
spec:
containers:
- name: main
image: muvaf/criu-in-pod:v1.1-restore
imagePullPolicy: Always
command:
- /bin/sh
args:
- -c
- "criu restore -D /root/images --shell-job -v4"
# - "sleep 999999"
securityContext:
privileged: true
resources:
limits:
cpu: "1"
memory: "1Gi"
volumeMounts:
- mountPath: /proc/sys/kernel/ns_last_pid
name: checkpoint-cap
volumes:
- name: checkpoint-cap
hostPath:
path: /proc/sys/kernel/ns_last_pid
type: File
The confusing part for me is that the restore command does report success as you can see in the attached log file. So I'm guessing the process dies right after criu starts it but I couldn't figure out why as it doesn't print anything at all. SID and PGID difference is the only thing that caught my attention between working and non-working cases.
Unrelated, but mount /proc/sys/kernel/ns_last_pid
seems unnecessary. According to the logs clone3()
is used and not ns_last_pid.
Not sure how you could find out why the process dies. You probably need to reproduce it locally and then you would be able to see why the process dies. Maybe using post-resume
scripts to collect data from the container why the process died. Not sure.
Unrelated, but mount /proc/sys/kernel/ns_last_pid seems unnecessary. According to the logs clone3() is used and not ns_last_pid.
Thanks! I did that to make sure the process gets a high PID so that I don't have collision in restored container but you're right it's not necessary. Now I tried starting the counter with /bin/sh -c counter
but changed the restore pod to start directly with criu restore
so the restored process got the same PID of 7
👍
You probably need to reproduce it locally and then you would be able to see why the process dies.
I just reproduced it with docker container as well but then tried to change the program to write to a file instead of stdout and it worked! So I think what's going on is that the restored Go program exits the moment it tries to write to stdout. Specifically, when I make it to write /tmp/logs.txt
, the restored program continues from where it left given that I make that file available in that container as well. However, when I tell it to write to /proc/self/fd/1
, it exits right after restore. Somehow, criu
itself is able to write to /proc/self/fd/1
as it prints all of its logs, but the restored program can't do that. Does that problem sound familiar? I'll give inheritance of external files a shot to see if it'd help to give /proc/self/fd/1
as inherited file.
For criu, process stdout and stderr are just pipes. CRIU restores them like any other resources, but it knows only about one end of each pipe, so it closes another end. Actually, these pipes are external resources and CRIU needs some help to restore them properly:
before executing criu dump
, let's save inode numbers of process stdout and stderr pipes:
readlink /proc/9288/fd/1 > /root/images/stdout-pipe &&
readlink /proc/9288/fd/2 > /root/images/stderr-pipe &&
criu dump -t 9288 --shell-job -D /tmp/images -v4 --leave-stopped --evasive-devices --display-stats
On restore, we need to specify what file descriptors should be used for these pipes:
$ criu restore -o restore.log -D /tmp/images --shell-job -v4 --inherit-fd fd[1]:$(cat /tmp/images/stdout-pipe) --inherit-fd fd[2]:$(cat /tmp/images/stderr-pipe)
My pid is 9288
Count is 1723
My pid is 9288
Count is 1724
My pid is 9288
Count is 1725
More info about restoring external file descriptors can be found here: https://criu.org/Inheriting_FDs_on_restore
Thanks @avagin ! That's what I figured too by reading how runc
does it here - they're doing exactly like your example by saving to a JSON in the dump folder and read from it during restore. I implemented as well, saw success and was about to share my findings here.
Example list of file descriptors of a process running inside a container (managed by runc-containerd-docker).
$ crit x . fds
9239
0: /dev/null
1: pipe[205347]
2: pipe[205348]
3: EVENTPOLL.5
4: pipe[208089]
5: pipe[208089]
cwd: /
root: /
For criu, process stdout and stderr are just pipes.
However, I don't think it's as straight-forward. When the process is started in detached mode, that's what happens as the container manager gets to be the middleman between the user and the output of the process. But if you run the process directly (for example runc run mycontainer
or just from your terminal), it does get assigned TTY
s.
$ crit x . fds
2558
0: TTY.2
1: TTY.2
2: TTY.2
3: EVENTPOLL.4
4: pipe[33822]
5: pipe[33822]
cwd: /tmp
root: /
When criu
restores this, it runs a notify action called orphan-pts-master
that's only available in swrk mode and if you talk with criu swrk
through a Unix connection - not file, because that's the only way to pass a file descriptor using SCM_RIGHTS
. I was deep into trying to figure out why I don't get that notify with go-criu
and I think that's because they were pipes all along but also because go-criu
uses the file directly, hence doesn't process the out of band data coming with the message. I'm planning to open a PR to make go-criu use unix connection like runc does and have that notify action as well. Though turns out none of that was needed in my case, as you pointed out it gets pipes instead of TTYs but I got to discover that way later 🙂
For future readers, another problem was that when criu
restores but detaches itself from the restored process, container runtime may kill the process. I'm not sure if having a proper handling of reparenting in my PID 1 would've help yet but keeping the criu restore
as the process's parent helped that sudden exit. In the end, I was able to dump and restore a simple counter in Kubernetes from within the container.
I'm closing the issue as there is a way to achieve this goal without any changes to criu.
Description
We're using criu with CRI-O with the latest changes but I'm working on another approach that does dump and restore inside the container that's run by Kubernetes using containerd. However, even though success is reported in the restore logs, the process exits right after the restore. The program I prepared for reproducing is a simple Go program that prints its PID and counter value.
I have a working case as well that I added the steps for. The main difference is that if I start the process in a separate terminal instead of letting k8s start it as usual, the dump I get for that process runs after the restoration. Running it in a separate terminal assigns SID and PGID that are different than
1
as I confirmed by inspecting the produced images usingcrit
. I had thought that maybe the problem is SID mismatch but the non-working case has thecriu restore
running as part of SID 1 already. It feels likely because of SID and/or attaching file descriptors of stdin/stdout but I couldn't be sure.Folder to run all these commands: reproduce.zip
Steps to reproduce non-working case:
docker build --platform linux/amd64,linux/arm64 -t muvaf/criu-in-pod:v1.0 .
kubectl apply -f to-dump.yaml
. (I used GKE)1.29.0
1.7.10
6.1.58
kubectl logs to-dump
.kubectl exec -it to-dump -- bash
mkdir /root/images && criu dump -t 9288 --shell-job -D /root/images -v4 --leave-stopped --evasive-devices --display-stats
crit x /root/images ps
- You'll see PID is 9288 but SID and PGID are 1.exit
rm -rf images/ && mkdir images/ && kubectl cp to-dump:/root/images/ images/
.docker build --platform linux/amd64,linux/arm64 -t muvaf/criu-in-pod:v1.0-restore -f Dockerfile-restore .
.kubectl apply -f to-restore.yaml
.kubectl logs to-restore
and see it exits after restore success message.Steps to reproduce working case:
counter
binary, I putecho 9287 > /proc/sys/kernel/ns_last_pid && sleep 99999
in theto-dump.yaml
file'sargs
list.kubectl apply -f to-dump.yaml
.kubectl logs to-dump
and see nothing as it sleeps.kubectl exec -it to-dump -- bash
and start thecounter
binary manually and keep that terminal tab open.kubectl exec -it to-dump -- bash
andmkdir /root/images && criu dump -t $(pgrep counter) --shell-job -D /root/images -v4 --leave-stopped --evasive-devices --display-stats
crit x /root/images ps
and see that PID is 9297 but now PGID is 9297 and SID is 9289.exit
rm -rf images/ && mkdir images/ && kubectl cp to-dump:/root/images/ images/
.docker build --platform linux/amd64,linux/arm64 -t muvaf/criu-in-pod:v1.1-restore -f Dockerfile-restore .
.sleep 99999
in theto-restore.yaml
file in theargs
list instead of the restore command (after-c
).to-restore.yaml
with the new tag (v1.1-restore
instead ofv1.0-restore
)kubectl apply -f to-restore.yaml
.kubectl exec -it to-restore -- bash
.crit x /root/images ps
to check PID, PGID, and SID.criu restore -D /root/images --shell-job -v4
.Describe the results you received: In the non-working case, the restore command succeeds as seen in the restore logs but the process never starts and container exits - no log of the counter process is seen.
Describe the results you expected:
I'd have expected the process to run after restoration and print the counter values when the pod is started with the
criu restore
command. The working case is not very useful as it doesn't attach the logs of the process to Kubernetes' log stream and also failure/success isn't reported back since it's in a different session.Additional information you deem important (e.g. issue happens only occasionally):
Consistently happening.
CRIU logs and information:
dump.log restore.log
Output of `criu --version`:
``` Version: 3.19 ```
Output of `criu check --all`:
``` Warn (criu/kerndat.c:1593): CRIU was built without libnftables support Warn (criu/kerndat.c:1243): Can't keep kdat cache on non-tempfs Error (criu/cr-check.c:1223): UFFD is not supported Error (criu/cr-check.c:1223): UFFD is not supported Warn (criu/cr-check.c:1346): Nftables based locking requires libnftables and set concatenations support Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. ```
Additional environment details:
1.29.0
1.7.10
criu
installed fromppa:criu/ppa
6.1.58