Open Luap99 opened 1 month ago
See https://github.com/containers/automation_images/pull/387#issuecomment-2404942252 , in particular, the criu 4.0 update:
debian | prior-fedora | fedora | fedora-aws | rawhide | |
---|---|---|---|---|---|
criu | 3.17.1-3 | 3.19-2 | 4.0-1 | 3.19-4 | 4.0-1 |
3.19-6 ⇑ | 3.19-7 ⇑ |
Reproducer:
$ sudo bin/podman run --rm --privileged --net=host --cgroupns=host -v /var/lib/containers -v $(pwd):/repo -w /repo -v /tmp:/tmp -it quay.io/libpod/fedora_podman:c20241010t105554z-f40f39d13 bash
[root@pholzing-fedora repo]# bin/podman run -d --name test quay.io/libpod/alpine:latest top
8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8
[root@pholzing-fedora repo]# bin/podman container checkpoint -P test
*** buffer overflow detected ***: terminated
2024-10-11T14:58:53.008984Z: CRIU feature checking failed -52. Please check CRIU logfile /var/lib/containers/storage/overlay-containers/8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8/userdata/dump.log
Error: `/usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8/userdata --pre-dump 8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8` failed: exit status 1
And the criu logfile was empty so nothing useful to see in there.
Trying to use a normal fedora image as base then install podman does not seem to reproduce and I tried both criu-3.19-4
and criu-4.0-1
so there must be some magic in our special test image.
@adrianreber @rst0git Any ideas what could cause *** buffer overflow detected ***: terminated
?
@Luap99 Would it be possible to confirm if the error appears with both runc and crun, or only with crun?
Well this is fun now I am no longer able to reproduce using the steps from above so I cannot tell.
@Luap99 I was able to replicate the error locally with the following commands, and confirm that appears with both runc and crun:
cd ~/go/src/github.com/containers/podman
sudo podman run --rm --privileged --net=host --cgroupns=host -v /var/lib/containers -v $(pwd):/repo -w /repo -v /tmp:/tmp -it quay.io/libpod/fedora_podman:c20241010t105554z-f40f39d13 bash
# bin/podman run -d --name test quay.io/libpod/alpine:latest top
# bin/podman container checkpoint -P test
It looks like CRIU fails with the following error:
00.124597) Putting tsock into pid 380229
(00.125016) Wait for parasite being daemonized...
(00.125031) Wait for ack 2 on daemon socket
(00.125271) Error (compel/src/lib/infect-rpc.c:44): Message reply from daemon is trimmed (12/0)
(00.125297) Error (compel/src/lib/infect.c:726): Can't switch parasite 380229 to daemon mode 0
(00.125323) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.125327) Error (compel/src/lib/ptrace.c:96): Can't poke 380229 @ 0x5573bb6df000 from 0x7ffef62e4418 sized 8
(00.125334) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.125337) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke
(00.125341) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 380229)
(00.125345) Warn (criu/parasite-syscall.c:439): Can't cure failed infection
(00.125349) Error (criu/cr-dump.c:1493): Can't infect (pid: 380229) with parasite
(00.125426) Unfreezing tasks into 1
(00.125431) Unseizing 380229 into 1
(00.125438) Error (compel/src/lib/infect.c:418): Unable to detach from 380229: No such process
(00.125451) Writing image inventory (version 1)
(00.125719) Error (criu/cr-dump.c:1905): Pre-dumping FAILED.
I also noticed that the message *** buffer overflow detected ***
appears with crun but not with runc:
crun:
DEBU[0000] the args to checkpoint: /usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata --pre-dump 3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9
*** buffer overflow detected ***: terminated
2024-10-15T17:31:49.172489Z: CRIU feature checking failed -52. Please check CRIU logfile /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata/dump.log
Error: `/usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata --pre-dump 3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9` failed: exit status 1
DEBU[0000] Shutting down engines
INFO[0000] Received shutdown.Stop(), terminating! PID=37015
runc:
DEBU[0000] the args to checkpoint: /usr/bin/runc checkpoint --image-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata --pre-dump 1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5
ERRO[0000] CRIU feature check failed
Error: `/usr/bin/runc checkpoint --image-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata --pre-dump 1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5` failed: exit status 1
DEBU[0000] Shutting down engines
INFO[0000] Received shutdown.Stop(), terminating! PID=36877
@adrianreber Do you have any ideas what may cause crun
and runc
to fail with CRIU feature checking failed
?
It is worth noting that criu check --feature mem_dirty_track
shows mem_dirty_track is supported
and the error disappears with the following change in Podman:
+++ b/utils/utils.go
@@ -39,7 +39,7 @@ func ExecCmdWithStdStreams(stdin io.Reader, stdout, stderr io.Writer, env []stri
cmd.Stdin = stdin
cmd.Stdout = stdout
cmd.Stderr = stderr
- cmd.Env = env
+ // cmd.Env = env
err := cmd.Run()
if err != nil {
A friendly reminder that this issue had no activity for 30 days.
With the latest image update (https://github.com/containers/podman/pull/24227) checkpoint is broken inside the container test:
Both
podman checkpoint container with --pre-checkpoint
andpodman checkpoint container with --pre-checkpoint and export (migration)
fail the same wayhttps://api.cirrus-ci.com/v1/artifact/task/5294903477927936/html/int-podman-fedora-40-root-container-sqlite.log.html
I don't have time to look into this so I am just going to skip this just filing this so we can track it.