checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.77k stars 561 forks source link

Question: Is it okay to simply close a connection in the TCP_CLOSE state upon restore? #2286

Open aegiryy opened 9 months ago

aegiryy commented 9 months ago

Description

We hit the following error when restoring a TCP connection:

(00.430332)    718: inet:       Restore: family AF_INET6   type SOCK_STREAM    proto IPPROTO_TCP      port 34622 state TCP_CLOSE        src_addr ::ffff:10.4.104.194
(00.430342)    718: tcp: Restoring TCP connection
(00.430347)    718: tcp: Restoring TCP connection id 17c8 ino 84d1d
(00.430375)    718: Error (soccr/soccr.c:501): Can't bind inet socket back (0.0.0.0): Cannot assign requested address
(00.430389)    718: Error (criu/files.c:1213): Unable to open fd=2053 id=0x17c8

What we observed is that the connection is already in the TCP_CLOSE state. So, instead of restoring it, can we simply close the connections upon restore? Would it leave any side effect that is noticeable to the restored application?

diff --git a/criu/sk-inet.c b/criu/sk-inet.c
index db11cfb64..1fe52321e 100644
--- a/criu/sk-inet.c
+++ b/criu/sk-inet.c
@@ -859,6 +859,13 @@ static int open_inet_sk(struct file_desc *d, int *new_fd)
                }

                mutex_lock(&ii->port->reuseaddr_lock);
+               if (ie->state == TCP_CLOSE) {
+                       if (shutdown(sk, SHUT_RDWR) && errno != ENOTCONN) {
+                               pr_perror("Unable to shutdown the socket id %x ino %x", ii->ie->id, ii->ie->ino);
+                       } else {
+                               pr_info("TCP_CLOSE connection is forcefully closed");
+                       }
+               }
                if (restore_one_tcp(sk, ii)) {
                        mutex_unlock(&ii->port->reuseaddr_lock);
                        goto err;

Steps to reproduce the issue: 1. 2. 3.

Describe the results you received:

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

CRIU logs and information:

CRIU full dump/restore logs:

``` ```

Output of `criu --version`:

``` $ criu --version Version: 3.17 GitID: v3.17-5-gea24496ac ```

Output of `criu check --all`:

``` Warn (criu/kerndat.c:1453): CRIU was built without libnftables support Warn (criu/cr-check.c:1334): Nftables based locking requires libnftables and set concatenations support Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. ```

$ uname -r
5.15.0-1047-aws
rst0git commented 8 months ago

@aegiryy Have you tried using the --tcp-close option?

https://github.com/checkpoint-restore/criu/blob/711775f401c8a4078fb38444a87d440f3fb1cb96/criu/sk-tcp.c#L458-L459

aegiryy commented 8 months ago

@rst0git I am aware of this option, but my understanding is that it will close all TCP connections regardless of their states. In my scenario, I want to keep all the active TCP connections as they are all through the loopback device (127.0.0.1). For the connections in TCP_CLOSE state, we are okay to have them closed upon restore, hence the question.

aegiryy commented 8 months ago

Based on the comment below, it seems to be a bug when we even try to call bind(...) when restoring a TCP connection in the TCP_CLOSE state: https://github.com/checkpoint-restore/criu/blob/711775f401c8a4078fb38444a87d440f3fb1cb96/criu/sk-inet.c#L195-L197

https://github.com/checkpoint-restore/criu/blob/711775f401c8a4078fb38444a87d440f3fb1cb96/soccr/soccr.c#L498-L501

Can you confirm if we expect a closed connection to ever go through bind on restore?

github-actions[bot] commented 7 months ago

A friendly reminder that this issue had no activity for 30 days.

avagin commented 7 months ago

Can you confirm if we expect a closed connection to ever go through bind on restore? yes, it is expected. getsockname returns a bound address and it works for closes sockets too, so we need to restore it.

(00.430332)    718: inet:       Restore: family AF_INET6   type SOCK_STREAM    proto IPPROTO_TCP      port 34622 state TCP_CLOSE        src_addr ::ffff:10.4.104.194
(00.430342)    718: tcp: Restoring TCP connection
(00.430347)    718: tcp: Restoring TCP connection id 17c8 ino 84d1d
(00.430375)    718: Error (soccr/soccr.c:501): Can't bind inet socket back (0.0.0.0): Cannot assign requested address
(00.430389)    718: Error (criu/files.c:1213): Unable to open fd=2053 id=0x17c8

Here, we see the AF_INET6 socket, but the address 0.0.0.0 is ipv4. It may be a dual-stack socket. And it is an ANY address that looks weird in this case too.

avagin commented 7 months ago

@aegiryy do you have any clue how this socket has been created?

github-actions[bot] commented 6 months ago

A friendly reminder that this issue had no activity for 30 days.