containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.91k stars 2.42k forks source link

CI: podman checkpoint container with --pre-checkpoint not working in container testing #24230

Open Luap99 opened 1 month ago

Luap99 commented 1 month ago

With the latest image update (https://github.com/containers/podman/pull/24227) checkpoint is broken inside the container test:

→ Enter [It] podman checkpoint container with --pre-checkpoint - /var/tmp/go/src/github.com[/containers/podman/test/e2e/checkpoint_test.go:969](https://github.com/containers/podman/blob/ee70c495901ce4865b8a61290700c027eabd7937/test/e2e/checkpoint_test.go#L969) @ 10/10/24 14:04:37.825
           # podman [options] run -d --network podman5 quay.io/libpod/alpine:latest top
           6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c
           # podman [options] container checkpoint -P 6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c
           *** buffer overflow detected ***: terminated
           CRIU feature checking failed -52.  Please check CRIU logfile /tmp/CI_Nlm2/podman-e2e-190218032/subtest-2996264589/root/overlay-containers/6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c/userdata/dump.log
           Error: `/usr/bin/crun checkpoint --image-path /tmp/CI_Nlm2/podman-e2e-190218032/subtest-2996264589/root/overlay-containers/6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c/userdata/pre-checkpoint --work-path /tmp/CI_Nlm2/podman-e2e-190218032/subtest-2996264589/root/overlay-containers/6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c/userdata --pre-dump 6d1f1d2b3d02e8d920b33038860e7bfdf077712b3f99389a1866be88393ab22c` failed: exit status 1

           [FAILED] Command failed with exit status 125. See above for error message.

Both podman checkpoint container with --pre-checkpoint and podman checkpoint container with --pre-checkpoint and export (migration) fail the same way

https://api.cirrus-ci.com/v1/artifact/task/5294903477927936/html/int-podman-fedora-40-root-container-sqlite.log.html

I don't have time to look into this so I am just going to skip this just filing this so we can track it.

edsantiago commented 1 month ago

See https://github.com/containers/automation_images/pull/387#issuecomment-2404942252 , in particular, the criu 4.0 update:

debian prior-fedora fedora fedora-aws rawhide
criu 3.17.1-3 3.19-2 4.0-1 3.19-4 4.0-1
3.19-6 ⇑ 3.19-7 ⇑
Luap99 commented 1 month ago

Reproducer:

$ sudo bin/podman run --rm --privileged --net=host --cgroupns=host -v /var/lib/containers -v $(pwd):/repo -w /repo -v /tmp:/tmp -it quay.io/libpod/fedora_podman:c20241010t105554z-f40f39d13 bash

[root@pholzing-fedora repo]# bin/podman run -d --name test quay.io/libpod/alpine:latest top
8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8
[root@pholzing-fedora repo]# bin/podman container checkpoint -P test
*** buffer overflow detected ***: terminated
2024-10-11T14:58:53.008984Z: CRIU feature checking failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8/userdata/dump.log
Error: `/usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8/userdata --pre-dump 8a080765b0f5aed1138e6ffb0d6c1c04a48aee93cf96776ba7059b6e775e8be8` failed: exit status 1

And the criu logfile was empty so nothing useful to see in there.

Trying to use a normal fedora image as base then install podman does not seem to reproduce and I tried both criu-3.19-4 and criu-4.0-1 so there must be some magic in our special test image.

@adrianreber @rst0git Any ideas what could cause *** buffer overflow detected ***: terminated?

rst0git commented 1 month ago

@Luap99 Would it be possible to confirm if the error appears with both runc and crun, or only with crun?

Luap99 commented 1 month ago

Well this is fun now I am no longer able to reproduce using the steps from above so I cannot tell.

rst0git commented 1 month ago

@Luap99 I was able to replicate the error locally with the following commands, and confirm that appears with both runc and crun:

cd ~/go/src/github.com/containers/podman
sudo podman run --rm --privileged --net=host --cgroupns=host -v /var/lib/containers -v $(pwd):/repo -w /repo -v /tmp:/tmp -it quay.io/libpod/fedora_podman:c20241010t105554z-f40f39d13 bash

# bin/podman run -d --name test quay.io/libpod/alpine:latest top
# bin/podman container checkpoint -P test

It looks like CRIU fails with the following error:

00.124597) Putting tsock into pid 380229
(00.125016) Wait for parasite being daemonized...
(00.125031) Wait for ack 2 on daemon socket
(00.125271) Error (compel/src/lib/infect-rpc.c:44): Message reply from daemon is trimmed (12/0)
(00.125297) Error (compel/src/lib/infect.c:726): Can't switch parasite 380229 to daemon mode 0
(00.125323) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.125327) Error (compel/src/lib/ptrace.c:96): Can't poke 380229 @ 0x5573bb6df000 from 0x7ffef62e4418 sized 8
(00.125334) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.125337) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke
(00.125341) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 380229)
(00.125345) Warn  (criu/parasite-syscall.c:439): Can't cure failed infection
(00.125349) Error (criu/cr-dump.c:1493): Can't infect (pid: 380229) with parasite
(00.125426) Unfreezing tasks into 1
(00.125431)     Unseizing 380229 into 1
(00.125438) Error (compel/src/lib/infect.c:418): Unable to detach from 380229: No such process
(00.125451) Writing image inventory (version 1)
(00.125719) Error (criu/cr-dump.c:1905): Pre-dumping FAILED.

dump.log

rst0git commented 1 month ago

I also noticed that the message *** buffer overflow detected *** appears with crun but not with runc:

crun:

DEBU[0000] the args to checkpoint: /usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata --pre-dump 3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9 
*** buffer overflow detected ***: terminated
2024-10-15T17:31:49.172489Z: CRIU feature checking failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata/dump.log
Error: `/usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9/userdata --pre-dump 3fbe9360c80bc925ff1f013624c2e31346448ddba08b8194d8f83749edec95c9` failed: exit status 1
DEBU[0000] Shutting down engines                        
INFO[0000] Received shutdown.Stop(), terminating!        PID=37015

runc:

DEBU[0000] the args to checkpoint: /usr/bin/runc checkpoint --image-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata --pre-dump 1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5 
ERRO[0000] CRIU feature check failed                    
Error: `/usr/bin/runc checkpoint --image-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata/pre-checkpoint --work-path /var/lib/containers/storage/overlay-containers/1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5/userdata --pre-dump 1a9049b53a4ddc54bff3f1bd18abd6e3f19c0c33ef43dac74dff1769ee479ee5` failed: exit status 1
DEBU[0000] Shutting down engines                        
INFO[0000] Received shutdown.Stop(), terminating!        PID=36877
rst0git commented 1 month ago

@adrianreber Do you have any ideas what may cause crun and runc to fail with CRIU feature checking failed?

It is worth noting that criu check --feature mem_dirty_track shows mem_dirty_track is supported and the error disappears with the following change in Podman:

+++ b/utils/utils.go
@@ -39,7 +39,7 @@ func ExecCmdWithStdStreams(stdin io.Reader, stdout, stderr io.Writer, env []stri
        cmd.Stdin = stdin
        cmd.Stdout = stdout
        cmd.Stderr = stderr
-       cmd.Env = env
+       // cmd.Env = env

        err := cmd.Run()
        if err != nil {
github-actions[bot] commented 1 week ago

A friendly reminder that this issue had no activity for 30 days.