checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.99k stars 599 forks source link

Restoration Fails with Open UDP Socket. Is There a Way to Ignore error and Proceed with Restoration? #2464

Open wjstk16 opened 3 months ago

wjstk16 commented 3 months ago

Hi all,

I am attempting to checkpoint and restore a container with an open UDP socket. Unlike TCP, UDP is a connectionless communication protocol and doesn't maintain a stateful connection. However, restoration might require retransmission due to data buffered in the kernel. My goal is for the UDP container to continue communication seamlessly after restoration, even if retransmissions occur. Unfortunately, I encounter the following error, which causes the restoration to fail.

I used Podman to create a UDP server container and a UDP client container, and the checkpoint/restore process works fine when performed on the same host machine. However, when I transfer the checkpointed tar file to a remote host and attempt to restore, I encounter the error mentioned below.

Here is the log output:

((00.108314) mnt: Switching to new ns to clean ghosts
(00.109452) net: Unlock network
(00.109496) Running network-unlock scripts
(00.109509)     RPC
(00.156697) pie: 1: seccomp: Restoring mode 1 flags 0x1 on tid 1 filter 0
(00.160377) pie: 1: seccomp: Restored mode 2 on tid 1
(00.160563) pie: 1: restoring lsm profile (current) changeprofile containers-default-0.44.4
(00.160707) pie: 1: Error (criu/pie/restorer.c:192): can't write lsm profile -2
(00.179980) pie: 1: Error (criu/pie/restorer.c:2168): BUG at criu/pie/restorer.c:2168
(00.180092) Error (compel/src/lib/infect.c:1612): Task 4148077 is in unexpected state: b7f
(00.180171) Error (compel/src/lib/infect.c:1618): Task stopped with 11: Segmentation fault
(00.180201) Error (criu/cr-restore.c:2469): Can't stop all tasks on rt_sigreturn
(00.180212) Error (criu/cr-restore.c:2530): Killing processes because of failure on restore.
The Network was unlocked so some data or a connection may have been lost.
(00.181450) Error (criu/mount.c:3689): mnt: Can't remove the directory /tmp/.criu.mntns.bVhQ14: No such file or directory
(00.181473) Error (criu/cr-restore.c:2557): Restoring FAILED.

I want to restore a process with an open UDP socket, even if it's not a complete restoration like with TCP (even if retransmission is necessary). Is there a way to ignore these errors and proceed with the restoration, similar to the tcp-close option?

Attachments: criu.log

Any assistance would be greatly appreciated. Thanks.

adrianreber commented 3 months ago

The error you see has nothing to do with UDP. You do not provide much information about the systems you are using, but it seems you are using Ubuntu with AppArmor enabled. During restore CRIU tries to restore the AppArmor profile and it fails:

(00.160563) pie: 1: restoring lsm profile (current) changeprofile containers-default-0.44.4

I have never tested Podman with AppArmor, so I do not know if that works. I know it works in combination with SELinux, so you could retry it on Fedora/CentOS/RHEL. Or try to disable AppArmor.

avagin commented 3 months ago

@wjstk16 Before disabling apparmor, you need to check that you have containers-default-0.44.4 on the remove machine. I think it hasn't been installed there and it is the issue.

github-actions[bot] commented 2 months ago

A friendly reminder that this issue had no activity for 30 days.