checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.86k stars 576 forks source link

Criu restore failed with segmentation fault inside container #2229

Closed indusai99 closed 11 months ago

indusai99 commented 1 year ago

Description:

I have fedora based docker environment. I have run the docker in privileged mode I have installed criu 3.15 version libraries in container. Below are rpm files that has been installed. criu-3.15-3.fc34.x86_64.rpm criu-libs-3.15-3.fc34.x86_64.rpm

I have written C program . and compiled using gcc compiler. dump_cli.txt

Dump: Now I am dumping above program using below command sudo criu dump -t -D logs -v4 -shell-job -R Result: Dump is succesful

Restore: After succesful, I am trying to restore using below command. sudo criu restore -D logs -v4 --shell-job Result: Restore is failing with below error. (00.204853) Error (criu/cr-restore.c:1562): 206 killed by signal 11: Segmentation fault (00.204903) Error (criu/cr-restore.c:2483): Restoring FAILED.

Criu Information inside docker:

Additional Information:

Please let me know if any info is required.

indusai99 commented 1 year ago

@adrianreber, I request your support to resolve the issue

mihalicyn commented 1 year ago

CRIU version 3.15 is too old and does not support rseq C/R. You need to use latest CRIU version.

adrianreber commented 1 year ago

That is an easy one. Especially as I just had to deal with a similar bug on RHEL.

New userspace (glibc) will use rseq with a 4.4 kernel. CRIU needs at least 5.13 to checkpoint such a process. You should not use new userspace with an older kernel in combination with CRIU. Using an older version of CRIU does not really help. I think you need at least kernel 5.13 or a backport of the corresponding ptrace call.

So your combination of kernel and userspace just doesn't work.

indusai99 commented 1 year ago

Thanks @adrianreber. Any map available to check such compatibility information?

indusai99 commented 1 year ago

@adrianreber , you are saying that using criu 3.15 version inside docker environment with kernel 5.12 won't support. We need to use criu latest version and kernel version of >5.13 .

adrianreber commented 1 year ago

Any map available to check such compatibility information?

I am not aware that something like this exists.

indusai99 commented 1 year ago

CRIU version 3.15 is too old and does not support rseq C/R. You need to use latest CRIU version.

@mihalicyn , the reason I downgraded criu from 3.18 to 3.15 is my kernel version(5.12) is older. This won't support ptrace_rseq_configuration. I am looking for the possibility to resolve restore failure(segmentation failure) issue when using criu 3.15 only with kernel 5.12 version inside docker

adrianreber commented 1 year ago

you are saying that using criu 3.15 version inside docker environment with kernel 5.12 won't support. We need to use criu latest version and kernel version of >5.13 .

Exactly. If you have a new userspace that uses rseq.

mihalicyn commented 1 year ago

the reason I downgraded criu from 3.18 to 3.15 is my kernel version(5.12) is older. This won't support ptrace_rseq_configuration.

And this approach is bad. Because this check about ptrace_rseq_configuration is a precaution for you as a user. It prevents you from shooting your own leg by making an improper process state dump that will eventually fail with segfault or something.

We never make any incompatible changes in a new CRIU releases which can prevent CRIU from working on the older systems (userspace/kernel) without serious reason. And this is a good example of this approach, latest CRIU version does not require ptrace_rseq_configuration feature to be present all the time, it requires this feature to be present only if your processes are using rseq (https://github.com/checkpoint-restore/criu/blob/53dd6ba74c4b8fed95d9c2292aae191b12c3977a/criu/pie/parasite.c#L325).

So, good policy is to always use latest stable CRIU version. It will be compatible with all previous kernel versions and all previous userspace versions. If it is not then it's a bug of CRIU and this bug should be reported.

The problem that you meet is not about CRIU it's a consequence of using modern enough userspace with old (enough) kernel. In this case CRIU won't work. Never. You need to update your kernel OR you need to downgrade your userspace. That's the only option. And I would suggest to update the kernel. It's easy and safe.

indusai99 commented 1 year ago

Thanks for the info @mihalicyn

indusai99 commented 1 year ago

I have downgraded my userspace and installed criu 3.15 inside docker. Now criu dump and restore is successful. But when I run sudo criu check in my docker environment, I am seeing below issue

Error (criu/util.c:631): exited, status=1
Error (criu/util.c:631): exited, status=1
Warn (criu/kerndat.c:877): Can't keep kdat cache on non-tempfs

I am seeing same issue in my dump.log, restore.log even though dump and restore are successful.

(00.000054) File /run/criu/criu.kdat does not exist
(00.000066) sockets: Probing sock diag modules
(00.000094) sockets: Done probing
(00.000693) Error (criu/util.c:635): exited, status=1
(00.001291) Error (criu/util.c:635): exited, status=1

Any idea about this ?

Note: I have upgraded my kernel to 6.2 and installed criu supported 3.16 version in docker and tried. Here also dump and restore are successful. But I could same above error in criu check, dump and restore logs

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

avagin commented 11 months ago

@indusai99 these errors have been suppressed by 4d67f67818.