containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.97k stars 2.43k forks source link

podman container restore: SEGV on rawhide #12949

Closed edsantiago closed 2 years ago

edsantiago commented 2 years ago

Reproducible from the very first try

# podman run -d quay.io/libpod/testimage:20210610 top
9ae1d60c30ae8a9ef6d03a98c65609c66722245be101a0466dfbdf2a8b344cb7
# podman container checkpoint 9ae
9ae1d60c30ae8a9ef6d03a98c65609c66722245be101a0466dfbdf2a8b344cb7
# podman container restore 9ae
Error: OCI runtime error: crun: CRIU restoring failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/9ae1d60c30ae8a9ef6d03a98c65609c66722245be101a0466dfbdf2a8b344cb7/userdata/restore.log
# cat that file
...
task_args->pid: 1
task_args->nr_threads: 1
task_args->clone_restore_fn: 0x11db0
task_args->thread_args: 0x25540
(00.152573) pie: 1: Switched to the restorer 1
(00.154294) Error (criu/cr-restore.c:1492): 21490 stopped by signal 11: Segmentation fault
(00.155560) mnt: Switching to new ns to clean ghosts
(00.156303) Error (criu/cr-restore.c:2447): Restoring FAILED.

criu-restore.log

podman-4.0.0-0.1.rc1.fc36.x86_64 criu-3.16.1-4.fc36.x86_64 5.17.0-0.rc0.20220112gitdaadb3bd0e8d.63.fc36.x86_64

edsantiago commented 2 years ago

Log from a real CI run

@adrianreber @rst0git PTAL

edsantiago commented 2 years ago

@lsm5 FYI this is for the other two failures

rst0git commented 2 years ago

@edsantiago this is a known issue (https://github.com/checkpoint-restore/criu/issues/1696). There is a pull request for CRIU https://github.com/checkpoint-restore/criu/pull/1706 and a workaround in https://github.com/checkpoint-restore/criu/commit/d99def7dcfa938918368c91021f72a77f738bc61

edsantiago commented 2 years ago

@rst0git thank you

@containers/podman-maintainers the abovementioned CRIU PR has been open for one month; CI is red, there is no indication of when it will merge and then when we'll get a new version. Should we disable checkpoint tests in rawhide for the next few months, so we can pass gating tests?

rst0git commented 2 years ago

@edsantiago I opened a pull request for go-criu with the workaround mentioned above: https://github.com/checkpoint-restore/go-criu/pull/61

This should allow the tests in CI to pass.

edsantiago commented 2 years ago

@rst0git aha - the part that was not obvious to me is that checkpoint-restore/go-criu is vendored in podman (currently v5.3.0). Presumably, if/when your PR merges, podman can bump go.mod to vendor in a (tagged? untagged?) go-criu. Leaving this here for benefit of anyone else unfamiliar with criu and its integration in podman. (If I am mistaken in anything, please correct me!) Thanks again.

adrianreber commented 2 years ago

Can you try to export GLIBC_TUNABLES=glibc.pthread.rseq=0 to see if this makes the error go away?

edsantiago commented 2 years ago

@lsm5 ^^ might be worth a try. The place to do it is the test yaml, adding a new environment stanza, but the hard part is getting that in the right place and with the right indentation and all that yamly stuff. If you feel comfortable yamling, this might be an easy way to get tests passing (assuming it works). If you're like me, and would need to spend an hour moving the minuses and spaces, it might be more trouble than it's worth.

edsantiago commented 2 years ago

@adrianreber the suggested envariable makes no difference that I can see:

# GLIBC_TUNABLES=glibc.pthread.rseq=0 bats /usr/share/podman/test/system/*checkpoint.bats
 ✗ podman checkpoint - basic test
...
   # podman container restore 55055c549b8ae0b6ecdc2f1b7c234dd239b9096f40bf3591c3b27464cb4080fa
   Error: OCI runtime error: crun: CRIU restoring failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/55055c549b8ae0b6ecdc2f1b7c234dd239b9096f40bf3591c3b27464cb4080fa/userdata/restore.log
 ✗ podman checkpoint --export, with volumes
   # podman container restore --import=/tmp/podman_bats.NX0UiV/c_Nl0hlYaSfs.tar.gz
   Error: OCI runtime error: crun: CRIU restoring failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/7fc7b146f81bcd918e685207cd7ac310658a8f5f3a926bd0ec03a7d887917b4b/userdata/restore.log
Luap99 commented 2 years ago

I think we call the oci runtime with clear environment variables so GLIBC_TUNABLES will not be set for crun/runc

mheon commented 2 years ago

Concur. I think there is a way to set environment variables for Conmon in containers.conf but I'm also not aware of us ever needing to use it in the last 2 years, so it might not even work.

mheon commented 2 years ago

(The field in containers.conf is conmon_env_vars=[])

mheon commented 2 years ago

Now, whether Conmon will call the OCI runtime with its full environment, I don't know - I only know that we clear environment before starting Conmon.

edsantiago commented 2 years ago

@mheon thank you! With this /etc/containers/containers.conf:

[engine]
conmon_env_vars = [ "GLIBC_TUNABLES=glibc.pthread.rseq=0" ]

...I get -70 instead of -52:

# podman container restore 3bfa
Error: OCI runtime error: crun: CRIU restoring failed -70.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/3bfa78a5d79fcec27e0c38a6e880375e85fcc2ee304f18ef45ca86d38c28068a/userdata/restore.log

This time, the pointed-to log file is empty (size zero) so there's nothing to attach. (And no, I'm not out of disk space).

The bug is almost certainly in the handling of conmon_env_vars because:

rst0git commented 2 years ago

We've updated the Fedora Rawhide package for CRIU with support for rseq: https://koji.fedoraproject.org/koji/buildinfo?buildID=1911510

edsantiago commented 2 years ago

Thank you. I've confirmed that criu-3.16.1-6.fc36 fixes the problem and is now in fc36 stable.