Open Tianyang-Zhang opened 1 year ago
A friendly reminder that this issue had no activity for 30 days.
Could anyone help with this issue?
did you set the SO_REUSEADDR
option when you create the socket?
did you set the SO_REUSEADDR option when you create the socket?
When looking into code I see one problem with SO_REUSEPORT, but it does not seem to be related to this issue...
See how in post_open_inet_sk() we say "/ SO_REUSEADDR is set for all sockets /", meaning that CRIU is restoring all sockets wih SO_REUSEADDR and SO_REUSEPORT first for CRIU's needs. And only then restores them to original dumped state.
But as you also can see in post_open_inet_sk(), tcp connections are handled differently delaying options restore to prepare_tcp_socks(), where we only care about addr reuse part and not about port reuse.
But in your logs I don't see a message "pie: Turning repair off for" which would indicate that the code flow passed the above.
A friendly reminder that this issue had no activity for 30 days.
Description
I have a simple program opens a TCP socket, bind and listen to address 0.0.0.0 and port 5050. If checkpoint it, kill the original process(if not killing original process, behavior is different, will describe below) and restoring the same snapshot multiple times to different PID namespaces, all of the restored processes are bind to address 0.0.0.0 and port 5050. But only the latest restored process can receive messages through the socket. If kill the last-restored process, the second-last-restored one will be able to receive messages.
If the original process is not killed, all restores will fail because of the
address already in use
error, which is expected. However, if the original process is killed, the restore will success.I tried to trace the cause. During the restore, there is a call
close_service_fd(TRANSPORT_FD_OFF)
near the end ofcr-restore.c::sigreturn_restore()
. If one of the CRIU restore process restores the socket but then sleep before callingclose_service_fd(TRANSPORT_FD_OFF)
, all other restores will fail because ofaddr in use
. Afterclose_service_fd(TRANSPORT_FD_OFF)
is called, the addr and port are somehow free to bind in another CRIU restore process(but will fail if try bind outside of CRIU).I haven't figure out why
close_service_fd(TRANSPORT_FD_OFF)
related to socket bind.Steps to reproduce the issue:
sudo criu dump --tree <pid> --images-dir <dir> --leave-running --shell-job --tcp-close
sudo unshare --pid --fork --mount-proc
Describe the results you received: All of the restores succeed(expect only 1 success), but only the last one can receive message through that TCP socket.
Describe the results you expected: Expect only the first restore success, and all other restores fail with
address already in use
error.Additional information you deem important (e.g. issue happens only occasionally): If keep the original process running, the behavior is correct, all restores to new PID namespace fail with
addr in use
error.CRIU logs and information:
CRIU full dump/restore logs:
``` [dump_log.txt](https://github.com/checkpoint-restore/criu/files/11193484/dump_log.txt) Restore log when restoring to a new PID namespace and the original process still alive, fail with `addr in use` as expected: [restore_to_PID-NS_log_when_original_proc_alive.txt](https://github.com/checkpoint-restore/criu/files/11193495/no_kill_restore_log.txt) Restore log when restore 2 times to host PID namespace and a new PID namespace, both succeed(expect 2nd fail): [kill_original_proc_and_res_to_host_PID-NS.txt](https://github.com/checkpoint-restore/criu/files/11193502/res_to_host.txt) [kill_original_proc_and_res_to_new_PID-NS.txt](https://github.com/checkpoint-restore/criu/files/11193503/res_to_new.txt) ```
Output of `criu --version`:
``` Version: 3.17.1 GitID: v3.17.1 ```
Output of `criu check --all`:
``` ./criu/criu check --all Warn (criu/cr-check.c:1231): clone3() with set_tid not supported Error (criu/cr-check.c:1273): Time namespaces are not supported Error (criu/cr-check.c:1283): IFLA_NEW_IFINDEX isn't supported Warn (criu/cr-check.c:1300): Pidfd store requires pidfd_open syscall which is not supported Warn (criu/cr-check.c:1334): Nftables based locking requires libnftables and set concatenations support Warn (criu/cr-check.c:804): ptrace(PTRACE_GET_RSEQ_CONFIGURATION) isn't supported. C/R of processes which are using rseq() won't work. Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. ```
Additional environment details: OS:
Kernel: