Closed muvaf closed 1 month ago
@muvaf Any luck here? Running into the same issue
@muvaf could you show /proc/pid/maps for the target process?
I think it hits the MAX_RW_COUNT (0x7ffff000) limit. The length of the target vma is 0x104fff3c000. CRIU reads 8 bytes per page, so it is 0x827ff9e0 bytes.
@muvaf could you try out https://github.com/avagin/criu/commit/9405da090c93ef100e3fb0c7da2646cdb9e27fc1? It should fix the problem.
By the way, for such large dummy mappings, the pagemap file interface works slowly. Recently, the new PAGEMAP_SCAN ioctl was merged into the mainline kernel, and its support was implemented in CRIU (https://github.com/checkpoint-restore/criu/pull/2292). With these changes, CRIU handles huge dummy mappings much faster.
@avagin That patch did make the error go away and I was able to take the checkpoint. Thank you! However, I wasn't able to validate that the checkpoint is correctly taken. The restore command fails with the following even though --tcp-close
is used:
criu restore --images-dir /checkpoint --tcp-established --file-locks --evasive-devices --tcp-close --manage-cgroups=ignore -v4 --log-file restore.log --inherit-fd fd[1]:pipe:[1687037] --inherit-fd fd[2]:pipe:[1687038] --external mnt[zoneinfo]:/usr/share/zoneinfo --external mnt[null]:/dev/null --external mnt[random]:/dev/random --external mnt[urandom]:/dev/urandom --external mnt[tty]:/dev/tty --external mnt[zero]:/dev/zero --external mnt[full]:/dev/full
Error (criu/sk-inet.c:1029): inet: Can't bind inet socket (id 778): Cannot assign requested address
@lukejmann I needed to add --file-locks
and make sure some of the folders Chrome creates under /tmp
is available on the target as well. To build @avagin 's patch, you need to clone, run make docker-build
and copy /criu/criu/criu
file from that image, given the dependencies listed here are in place where you run the commands.
@muvaf Well, they are udp sockets :)
(00.289251) 9344: inet: Restore: family AF_INET type SOCK_DGRAM proto IPPROTO_UDP port 56915 state TCP_ESTABLISHED src_addr 10.140.3.135
It is a good question what should we do with them... @muvaf What behavior do you expect?
@avagin Huh I didn't realize that. I think, for starters, having an --udp-close
flag similar to TCP would unlock part of the use cases where the both sides of the socket are designed to be resilient to reconnections.
To go further may not be feasible due to the same issue TCP has in regards to change of IP addresses, so at least we'd give users an escape hatch if they really have to change the IP address.
@avagin FWIW, if you can give me a pointer, I can try to get a PR going to add the --udp-close
flag.
@muvaf I am skeptical about the idea of "--udp-close." There is a significant difference between TCP and UDP. TCP is connection-oriented, and the situation where a connection is interrupted is entirely normal and must be handled in the code. UDP, on the other hand, is connectionless. Therefore, applications may be caught off guard if "send" or "recv" return errors.
You can try out the next patch to see how your workload will handle closed udp sockets after restore:
diff --git a/criu/sk-inet.c b/criu/sk-inet.c
index a6a767c73..eda08f971 100644
--- a/criu/sk-inet.c
+++ b/criu/sk-inet.c
@@ -901,6 +901,13 @@ static int open_inet_sk(struct file_desc *d, int *new_fd)
goto done;
}
+ if (ie->proto == IPPROTO_UDP) {
+ if (shutdown(fd, SHUT_RDWR) && errno != ENOTCONN) {
+ pr_perror("Unable to shutdown the socket id %x ino %x", ii->ie->id, ii->ie->ino);
+ }
+ goto done2;
+ }
+
if (ie->src_port) {
if (inet_bind(sk, ii))
goto err;
@@ -952,7 +959,7 @@ done:
}
}
}
-
+done2:
*new_fd = sk;
return 1;
For connected UDP sockets, it might be a good idea to skip binding to the local address. When CRIU calls "connect" to restore the destination address and port, the socket will be bound to the source address and a "random" port. I believe this should work in many cases. Could you please try the next patch, which implements this behavior?
diff --git a/criu/sk-inet.c b/criu/sk-inet.c
index a6a767c73..9bb1d04d4 100644
--- a/criu/sk-inet.c
+++ b/criu/sk-inet.c
@@ -900,8 +900,7 @@ static int open_inet_sk(struct file_desc *d, int *new_fd)
goto done;
}
-
- if (ie->src_port) {
+ if (ie->proto != IPPROTO_UDP && ie->src_port) {
if (inet_bind(sk, ii))
goto err;
}
A friendly reminder that this issue had no activity for 30 days.
Description
Taking checkpoint of
chrome
fails with the following error:filling VMA 738000c4000-83d00000000 (1094712560K)
-1094712560K
sounds too big?Steps to reproduce the issue:
The dump happens inside a Kubernetes pod and the image is proprietary. I can create a new image if it turns out the problem is not much straight-forward and requires full reproduction to debug.
I used the following command to dump:
The process tree was like the following:
crit
commands likecrit x . fds
also fail because dump is not complete.Describe the results you received:
Got error during dump:
Describe the results you expected:
Expected success.
Additional information you deem important (e.g. issue happens only occasionally):
Happens consistently.
CRIU logs and information:
dump.log
Output of `criu --version`:
Version: 3.19 (gitid 0)
Output of `criu check --all`:
``` Warn (criu/kerndat.c:1285): Can't keep kdat cache on non-tempfs Error (criu/cr-check.c:1223): UFFD is not supported Error (criu/cr-check.c:1223): UFFD is not supported Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. ```
Additional environment details:
It's running inside a Kubernetes pod where container runtime is containerd and node arch is amd64.