checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.97k stars 596 forks source link

Is restore of established socket possible after `leave-running` closed it? #1120

Closed acidghost closed 4 years ago

acidghost commented 4 years ago

I have the following scenario:

The last part is giving me troubles. Restoring of the client socket works fine but CRIU RPC reply to my restore request for the server gives me the following:

(00.001127)   2232: Error (soccr/soccr.c:529): Can't connect inet socket back: Cannot assign requested address
(00.001137)   2232: Error (criu/files.c:1211): Unable to open fd=4 id=0x14

Full CRIU restore log: server_restore.log

TCP packet capture: criu-pcap.tar.gz

Is what I'm trying to do even possible? If so, what am I doing wrong?

Please let me know if you need clarification or further info.

adrianreber commented 4 years ago

Bit confused by your wording. When you write "I fork-exec into an FTP server", what does that mean?

Looking at the packet capture it seems your server is running on port 2200, right? You only use 127.0.0.1 for your test, right?

Not an expert when it comes to TCP connections, but are the involved ports free when you run the restore?

acidghost commented 4 years ago

Bit confused by your wording. When you write "I fork-exec into an FTP server", what does that mean?

From my process, I spawn a new process which is an FTP server. I'm coding in Rust, which I think uses the fork-exec model to spawn a new process with Command::spawn.

Looking at the packet capture it seems your server is running on port 2200, right? You only use 127.0.0.1 for your test, right?

Yes, running on 2200 and localhost.

Not an expert when it comes to TCP connections, but are the involved ports free when you run the restore?

❯ netstat | grep 2200
tcp        0      1 localhost:2200          localhost:37514         FIN_WAIT1
tcp        0      1 localhost:37514         localhost:2200          FIN_WAIT1

Indeed, it looks like the sockets are still there in FIN_WAIT1 state. Although when I try running my test program in succession, it runs fine and is able to open connections on port 2200 (albeit unable to restore). Here's netstat output after I run my program 3 times:

❯ netstat | grep 2200
tcp        0      1 localhost:2200          localhost:37534         FIN_WAIT1
tcp        0      1 localhost:2200          localhost:37540         FIN_WAIT1
tcp        0      1 localhost:2200          localhost:37536         FIN_WAIT1
tcp        0      1 localhost:37536         localhost:2200          FIN_WAIT1
tcp        0      1 localhost:37540         localhost:2200          FIN_WAIT1
tcp        0      1 localhost:37534         localhost:2200          FIN_WAIT1
adrianreber commented 4 years ago

Can you restore it after the FIN_WAIT1 state is gone?

acidghost commented 4 years ago

I could try to restore the server, but the client socket is only in memory.

First restore try: restore.log Second try: restore2.log

I think it repaired the socket successfully, but failed in the end when unlocking the network via iptables. Nonetheless the server is running again, I can netcat successfully and see open port with netstat.

What can I do to release the socket as soon as I dump the last image (w/o leave-running)? The FIN_WAIT1 state persists for 10-30 seconds, I cannot just wait for it to pass as my application is performance-critical.

adrianreber commented 4 years ago

The iptables failure should not be a real problem. It is reported, but as far as I know it should not be fatal.

What can I do to release the socket as soon as I dump the last image (w/o leave-running)? The FIN_WAIT1 state persists for 10-30 seconds, I cannot just wait for it to pass as my application is performance-critical.

I cannot answer that. Maybe someone else has an idea.

avagin commented 4 years ago

What can I do to release the socket as soon as I dump the last image (w/o leave-running)? The FIN_WAIT1 state persists for 10-30 seconds, I cannot just wait for it to pass as my application is performance-critical.

You can to restore in a new network namespace. "unshare -n" or "ip netns" can be used to create a namespace

acidghost commented 4 years ago

I had some operations not in the right order but mostly I had to wait a bit before closing the connection to allow the kernel to free the sockets properly before restoring the server.