Open Wosch96 opened 2 years ago
@Snorch and ideas what might be going on?
Just FYI: I'm running singularity-ce version 3.9.5, the newest one. Did the checkpointing work with older versions of singularity?
Cheers.
I am not aware that CRIU ever worked with singularity. No one spoke to us about singularity.
For crun and runc we have integration of CRIU directly in the container runtime. Checkpointing without the help of the container runtime is always challenging.
Yeah, I heard about that. I'm doing this for my own project to get CRIU running with singularity. This issuer here tried the same, but he found a way to solve it. The error in my case is a little different and could be related to the newer version of singularity.
Error (criu/files-reg.c:1629): Can't lookup mount=39 for fd=-3 path=/usr/local/libexec/singularity/bin/starter
This looks like a file mapping on detached mount (known problem).
E.g. I can reproduce the same error with a simple bash script in Virtuozzo container:
CT-2b5b6c67-d666-4950-abd0-8c0ceca03d96 /# cat prepare_detached.sh
mount -t tmpfs tmpfs /mnt/
touch /mnt/test
setsid sleep 1000 &>/dev/null </mnt/test &
umount -l /mnt
CT-2b5b6c67-d666-4950-abd0-8c0ceca03d96 /# bash prepare_detached.sh
CT-2b5b6c67-d666-4950-abd0-8c0ceca03d96 /# logout
exited from CT 2b5b6c67-d666-4950-abd0-8c0ceca03d96
You have new mail in /var/spool/mail/root
[root@silo ~]# vzctl suspend testct
Setting up checkpoint...
(00.396621) Error (criu/files-reg.c:2195): Can't lookup mount=638 sdev=174 for fd=0 path=/test
(00.396632) Error (criu/cr-dump.c:1868): Dump files (pid: 248003) failed with -1
(00.416616) Error (criu/cr-dump.c:2311): Dumping FAILED.
Failed to checkpoint the Container
All dump files and logs were saved to /vz/private/2b5b6c67-d666-4950-abd0-8c0ceca03d96/dump/Dump.fail
Checkpointing failed
So if mount on which process has an open file was lazy-umounted criu just can't c/r this process, unless the file is closed.
upd:
Other option is that your file mapping can be external (file outside of container), proper --external file[]
+ --inherit-fd
should be provided by container environment in this case.
Other option is that your file mapping can be external (file outside of container), proper --external file[] + --inherit-fd should be provided by container environment in this case.
@Snorch So I should be able to checkpoint the container with these two commands? In my case there's not really a problem with a file or am I wrong there? Can I use these two commands --external file[] + --inherit-fd to bypass the lookup mount problem? The other external mount map commands do their job correct.
Thank you for the help.
@Wosch96 please see examples and explanations about how and when --external file[] + --inherit-fd
can be used to handle external files in this article https://criu.org/Inheriting_FDs_on_restore
In simple words file is external if it was not opened/mmaped inside container. This way criu can't find it inside container, and can't automatically restore it. in this situations the container manager which should know all files which it puts into container from host would be able to provide criu information from where to take this file from host on restore (via the mentioned options).
In my environment, I don't have a container manager, as I build a singularity container from scratch with a definition file and then try to checkpoint the application that I'm running inside this container. Therefore I'm not really sure how to use the external files and --inherit-fd in my case. Could you clarify that? I read the acrticle about "inheriting FD's on restore" but it's my first time using CRIU so I don't get the exact usage for my example case.
In my case:
My dump file tells me about this file /usr/local/libexec/singularity/bin/starter where no mount look up is possible.
So when dumping, I would use a command like this?
use --external file[39:inode]
But I'm not sure how to check the mount id and the "Inode"? I assume the mount id in my dump.log tells me it's 39?
When I restart the process I should work with something like this?
--inherit-fd fd[0]:file[39:...]
.
As I said before, I'm not really sure about the information that is needed to set the --inherit-fd.
I appreciate the help.
@Wosch96 If you need to dump only one application and don't need to dump the container itself, it can be easier to run criu from the container and avoid all these problems with mounts and external fds.
@avagin Yes, I already thought so, but for my project I want to test container dumping. So I still would like to try it that way, if it's only for reasearch purposes. Any advices?
@Wosch96 I recommend to look at how C/R is implemented in runc: https://github.com/opencontainers/runc/blob/main/libcontainer/container_linux.go#L767-L1894
The most complicated part is how to handle external resources (mounts, file descriptors and etc).
Little update by my side, I got the dump running. I solved it like in this issue. But I'm still having a problem when restoring.
After running this command:
strace -o strace.log -s 256 -f criu restore -o restore.log -v4 -D ./ --shell-job --root /home/node2/container/criu_checkpoints/criu_container_namespace --ext-mount-map /etc/resolv.conf:/home/node2/container/criu_checkpoints/criu_container_namespace/etc --ext-mount-map /etc/hosts:/home/node2/container/criu_checkpoints/criu_container_namespace/etc --ext-mount-map /etc/hostname:/home/node2/container/criu_checkpoints/criu_container_namespace/etc --ext-mount-map /var/tmp:/home/node2/container/criu_checkpoints/criu_container_namespace/var --ext-mount-map /tmp:/home/node2/container/criu_checkpoints/criu_container_namespace/tmp --ext-mount-map /root:/home/node2/container/criu_checkpoints/criu_container_namespace/root --ext-mount-map /etc/localtime:/home/node2/container/criu_checkpoints/criu_container_namespace/etc --ext-mount-map /sys:/home/node2/container/criu_checkpoints/criu_container_namespace/sys --ext-mount-map /proc:/home/node2/container/criu_checkpoints/criu_container_namespace/proc/ --ext-mount-map /dev:/home/node2/container/criu_checkpoints/criu_container_namespace/dev --ext-mount-map /dev/hugepages:/home/node2/container/criu_checkpoints/criu_container_namespace/dev --ext-mount-map /dev/mqueue:/home/node2/container/criu_checkpoints/criu_container_namespace/dev --ext-mount-map /dev/pts:/home/node2/container/criu_checkpoints/criu_container_namespace/dev --ext-mount-map /dev/shm:/home/node2/container/criu_checkpoints/criu_container_namespace/dev --ext-mount-map /etc/group:/home/node2/container/criu_checkpoints/criu_container_namespace/etc --ext-mount-map /etc/passwd:/home/node2/container/criu_checkpoints/criu_container_namespace/etc --ext-mount-map /home/node2:/home/node2/container/criu_checkpoints/criu_container_namespace/home --ext-mount-map /proc/sys/fs/binfmt_misc:/home/node2/container/criu_checkpoints/criu_container_namespace/proc --ext-mount-map /usr/share/zoneinfo/UTC:/home/node2/container/criu_checkpoints/criu_container_namespace/usr
The strace file shows this error.
1272 write(127, "(00.025032) 1272: Opening 0x00000000400000-0x00000000401000 0000000000000000 (41) vma\n", 88) = 88 1272 openat(120, "home/node2/container/matMult", O_RDONLY) = -1 ENOENT (No such file or directory) 1272 write(127, "(00.025092) 1272: Error (criu/files-reg.c:2143): Can't open file home/node2/container/matMult on restore: No such file or directory\n", 134) = 134 1272 write(127, "(00.025113) 1272: Error (criu/files-reg.c:2086): Can't open file home/node2/container/matMult: No such file or directory\n", 123) = 123 1272 write(127, "(00.025131) 1272: Error (criu/mem.c:1349): - Can't open vma\n", 63) = 63
I've fixed all the external mount problems, but I don't know where to drop this file in order to let CRIU find it. Should this also be solved with an external mount? The application file matMult is in /home/node2/container/matMult.
Should i copy it to the root path? /home/node2/container/criu_checkpoints/criu_container_namespace/home/node2/container/matMult
Logs: dump.log restore.log strace.log
From how I understand the dump.log the original container had /home/node2
mounted at /home/node2
in the container.
For the restore it seems you are mounting /home/node2/container/criu_checkpoints
at /home/node2
and now it seems your binary is not there.
Should i copy it to the root path? /home/node2/container/criu_checkpoints/criu_container_namespace/home/node2/container/matMult
To fix the error it would be enough to just copy /home/node2/container/matMult to /home/node2/container/criu_checkpoints/node2/container/matMult ... (if I am not confused by your paths).
@adrianreber Just for specifying the dump command. I used this command:
criu dump -o dump.log -v4 -t 1581 -D ./ --shell-job --ext-mount-map /etc/resolv.conf:/etc/resolv.conf --ext-mount-map /etc/hosts:/etc/hosts --ext-mount-map /etc/hostname:/etc/hostname --ext-mount-map /var/tmp:/var/tmp --ext-mount-map /tmp:/tmp --ext-mount-map /root:/root --ext-mount-map /etc/localtime:/etc/localtime --ext-mount-map /tmp:/tmp --ext-mount-map /sys:/sys --ext-mount-map /proc:/proc --ext-mount-map /dev:/dev --ext-mount-map /dev/hugepages:/dev/hugepages --ext-mount-map /dev/mqueue:/dev/mqueue --ext-mount-map /dev/pts:/dev/pts --ext-mount-map /dev/shm:/dev/shm --ext-mount-map /etc/group:/etc/group --ext-mount-map /etc/passwd:/etc/passwd --ext-mount-map /home/node2:/home/node2 --ext-mount-map /proc/sys/fs/binfmt_misc:/proc/sys/fs/binfmt_misc --ext-mount-map /usr/share/zoneinfo/UTC:/usr/share/zoneinfo/UTC
So the correct restore command would looke something like this?
criu restore -o restore.log -v4 -D ./ --shell-job --root /home/node2/container --ext-mount-map /etc/resolv.conf:/criu_checkpoints/criu_container_namespace/etc --ext-mount-map /etc/hosts:/criu_checkpoints/criu_container_namespace/etc --ext-mount-map /etc/hostname:/criu_checkpoints/criu_container_namespace/etc --ext-mount-map /var/tmp:/criu_checkpoints/criu_container_namespace/var
For minimal purposes of course. There are still missing the other external mounts.
So the correct restore command would look something like this?
Does it work if you try it? :wink:
I am confused, but shouldn't you be using it during restore like this: --ext-mount-map /etc/resolv.conf:/etc/resolv.conf
? I am only using it via --external
so maybe the older --ext-mount-map
has different semantics. Not sure.
I would say you should tell CRIU to mount exactly the same directories at the same location during restore. I would expect that --ext-mount-map
is the same during checkpoint and restore in your case where the mountpoints in the container have exactly the same name as their external locations. If you actually want to mount the same directories and files as during checkpointing.
@adrianreber I tried your approach to set the ext-mount -map as before in dumping. First i shortended the dump command as some of the external mounts were not needed.
criu dump -o dump.log -v4 -t 1268 -D ./ --shell-job --ext-mount-map /etc/group:/etc/group --ext-mount-map /etc/passwd:/etc/passwd --ext-mount-map /etc/resolv.conf:/etc/resolv.conf --ext-mount-map /var/tmp:/var/tmp --ext-mount-map /tmp:/tmp --ext-mount-map /home/node2:/home/node2 --ext-mount-map /proc/sys/fs/binfmt_misc:/proc/sys/fs/binfmt_misc --ext-mount-map /proc:/proc --ext-mount-map /etc/hosts:/etc/hosts --ext-mount-map /usr/share/zoneinfo/UTC:/usr/share/zoneinfo/UTC --ext-mount-map /dev/mqueue:/dev/mqueue --ext-mount-map /dev/hugepages:/dev/hugepages --ext-mount-map /dev/pts:/dev/pts --ext-mount-map /dev/shm:/dev/shm --ext-mount-map /dev:/dev
Then I used this restore command:
strace -o strace.log -s 256 -f criu restore -o restore.log -v4 -D ./ --shell-job --root ./ --ext-mount-map /etc/group:/etc/group --ext-mount-map /etc/passwd:/etc/passwd --ext-mount-map /etc/resolv.conf:/etc/resolv.conf --ext-mount-map /var/tmp:/var/tmp --ext-mount-map /tmp:/tmp --ext-mount-map /home/node2:/home/node2 --ext-mount-map /proc/sys/fs/binfmt_misc:/proc/sys/fs/binfmt_misc --ext-mount-map /proc:/proc --ext-mount-map /etc/hosts:/etc/hosts --ext-mount-map /usr/share/zoneinfo/UTC:/usr/share/zoneinfo/UTC --ext-mount-map /dev/mqueue:/dev/mqueue --ext-mount-map /dev/hugepages:/dev/hugepages --ext-mount-map /dev/pts:/dev/pts --ext-mount-map /dev/shm:/dev/shm --ext-mount-map /dev:/dev
I'm not sure with the root path. What should I use there? In this case the root path is /home/node2/container/criu_checkpoints. That's where i use the command. The container root path is /home/node2/container.
I'm getting this error in the strace:
1268 write(4, "(00.014934) 1268: mnt: \tMounting unsupported @/tmp/.criu.mntns.PvR0Tp/9-0000000000/usr/share/zoneinfo/UTC (0)\n", 112) = 112 1268 write(4, "(00.014952) 1268: mnt: \tBind /usr/share/zoneinfo/UTC to /tmp/.criu.mntns.PvR0Tp/9-0000000000/usr/share/zoneinfo/UTC\n", 118) = 118 1268 mount("/usr/share/zoneinfo/UTC", "/tmp/.criu.mntns.PvR0Tp/9-0000000000/usr/share/zoneinfo/UTC", NULL, MS_BIND, NULL) = -1 ENOENT (No such file or directory)
I set this external mount, so I can't explain this to me.
Please show /proc/PID/mountinfo
from one of the processes in the container.
Here it is:
205 152 0:44 / / rw,nodev,relatime unbindable - overlay overlay ro,seclabel,lowerdir=/usr/local/var/singularity/mnt/session/overlay-lowerdir:/usr/local/var/singularity/mnt/session/rootfs 209 205 0:5 / /dev rw,nosuid master:107 - devtmpfs devtmpfs rw,seclabel,size=495600k,nr_inodes=123900,mode=755 210 209 0:18 / /dev/shm rw,nosuid,nodev master:110 - tmpfs tmpfs rw,seclabel 211 209 0:12 / /dev/pts rw,nosuid,noexec,relatime master:113 - devpts devpts rw,seclabel,gid=5,mode=620,ptmxmode=000 212 209 0:36 / /dev/hugepages rw,relatime master:114 - hugetlbfs hugetlbfs rw,seclabel 213 209 0:14 / /dev/mqueue rw,relatime master:115 - mqueue mqueue rw,seclabel 214 205 253:0 /usr/share/zoneinfo/Europe/Berlin /usr/share/zoneinfo/UTC rw,nosuid,nodev,relatime master:104 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota 215 205 253:0 /etc/hosts /etc/hosts rw,nosuid,nodev,relatime master:104 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota 216 205 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:116 - proc proc rw 217 216 0:35 / /proc/sys/fs/binfmt_misc rw,relatime master:117 - autofs systemd-1 rw,fd=22,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=12388 218 205 0:17 / /sys rw,nosuid,nodev,relatime - sysfs sysfs rw,seclabel 220 205 253:0 /home/node2 /home/node2 rw,nosuid,nodev,relatime master:104 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota 221 205 253:0 /tmp /tmp rw,nosuid,nodev,relatime master:104 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota 222 221 0:40 / /tmp/.criu.mntns.bw8pi9 rw,relatime master:140 - tmpfs none rw,seclabel 223 221 0:41 / /tmp/.criu.mntns.fnVLvz rw,relatime master:141 - tmpfs none rw,seclabel 224 221 0:42 / /tmp/.criu.mntns.nNajIA rw,relatime master:142 - tmpfs none rw,seclabel 225 221 0:43 / /tmp/.criu.mntns.PvR0Tp rw,relatime master:143 - tmpfs none rw,seclabel 226 205 253:0 /var/tmp /var/tmp rw,nosuid,nodev,relatime master:104 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota 227 205 0:39 /etc/resolv.conf /etc/resolv.conf rw,nosuid,relatime master:144 - tmpfs tmpfs rw,seclabel,size=16384k,uid=1000,gid=1000 228 205 0:39 /etc/passwd /etc/passwd rw,nosuid,relatime master:144 - tmpfs tmpfs rw,seclabel,size=16384k,uid=1000,gid=1000 229 205 0:39 /etc/group /etc/group rw,nosuid,relatime master:144 - tmpfs tmpfs rw,seclabel,size=16384k,uid=1000,gid=1000
This is the PID from the process I want to checkpoint inside the container.
/usr/share/zoneinfo/Europe/Berlin
/usr/share/zoneinfo/UTC
is this the problem?
This is the PID from the process I want to checkpoint inside the container.
/usr/share/zoneinfo/Europe/Berlin
/usr/share/zoneinfo/UTC
is this the problem?
Could be. Try to mount /usr/share/zoneinfo/Europe/Berlin
on /usr/share/zoneinfo/UTC
as it was done during checkpointing.
I used this command with restore: --ext-mount-map /usr/share/zoneinfo/UTC:/usr/share/zoneinfo/Europe/Berlin
But still a problem:
1268 write(4, "(00.014159) 1268: mnt: \tBind /usr/share/zoneinfo/Europe/Berlin to /tmp/.criu.mntns.CND5wH/9-0000000000/usr/share/zoneinfo/UTC\n", 128) = 128 1268 mount("/usr/share/zoneinfo/Europe/Berlin", "/tmp/.criu.mntns.CND5wH/9-0000000000/usr/share/zoneinfo/UTC", NULL, MS_BIND, NULL) = -1 ENOENT (No such file or directory) 1268 write(4, "(00.014215) 1268: Error (criu/mount.c:2263): mnt: Can't mount at /tmp/.criu.mntns.CND5wH/9-0000000000/usr/share/zoneinfo/UTC: No such file or directory\n", 154) = 154 1268 statfs("/tmp/.criu.mntns.CND5wH/9-0000000000/usr/share/zoneinfo/UTC", 0x7ffe70c33b70) = -1 ENOENT (No such file or directory) 1268 write(4, "(00.014257) 1268: Error (criu/mount.c:2518): mnt: Unable to statfs /tmp/.criu.mntns.CND5wH/9-0000000000/usr/share/zoneinfo/UTC: No such file or directory\n", 156) = 156
This sounds like a problem I had to deal with a couple of times while integrating CRIU in container runtimes/engines.
Historically containter runtimes/engines always just create the destination mount point. In this case it sounds like /usr/share/zoneinfo/UTC
does not exist and it probably created automatically by singularity during container create/run.
During restore you have to do the same thing that is happening during create/run. If it is part of the container runtime/engine it can be solved. You now need to create /usr/share/zoneinfo/UTC
before CRIU is running. But if it is a nested mount point it might be tricky. Not sure if action-scripts can help here.
You probably need to handle the root directory of your container during checkpoint. It seems to be an overlay directory:
205 152 0:44 / / rw,nodev,relatime unbindable - overlay overlay ro,seclabel,lowerdir=/usr/local/var/singularity/mnt/session/overlay-lowerdir:/usr/local/var/singularity/mnt/session/rootfs
Maybe you also need an --ext-mount-map
for your root directory.
@adrianreber But in the container /usr/share/zoneinfo/UTC already exists, as the container works in the user space. I call the CRIU dump after the container is setup and has the directory /usr/share/zoneinfo/UTC.
I don't understand the error why the folder /usr/share/zoneinfo/UTC can't be found. It exists on the container and on my hostsystem.
What should I set as root directory in this case?
My current understanding is that CRIU tries to mount all external mounts as they were previously. It fails to mount /usr/share/zoneinfo/UTC
because the destination file usr/share/zoneinfo/UTC
, relative to the container root, does not exist. You need to provide a root directory which has the mountpoint usr/share/zoneinfo/UTC
. Right now I do not remember if this need --root
or the correct entry for --ext-mount-map
. I would try --root
first. Do you have a usr/share/zoneinfo/UTC
at the location you specify with --root
?
Gave my root directory as root path.
strace -o strace.log -s 256 -f criu restore -o restore.log -v4 -D ./ --shell-job --root / --ext-mount-map /etc/group:/etc/group --ext-mount-map /etc/passwd:/etc/passwd --ext-mount-map /etc/resolv.conf:/etc/resolv.conf --ext-mount-map /var/tmp:/var/tmp --ext-mount-map /tmp:/tmp --ext-mount-map /home/node2:/home/node2 --ext-mount-map /proc/sys/fs/binfmt_misc:/proc/sys/fs/binfmt_misc --ext-mount-map /proc:/proc --ext-mount-map /etc/hosts:/etc/hosts --ext-mount-map /dev/mqueue:/dev/mqueue --ext-mount-map /dev/hugepages:/dev/hugepages --ext-mount-map /dev/pts:/dev/pts --ext-mount-map /dev/shm:/dev/shm --ext-mount-map /dev:/dev --ext-mount-map /etc/localtime:/etc/localtime
Now another error appears.
mount("sysfs", "/tmp/.criu.mntns.1Sf3Gw/9-0000000000/sys", "sysfs", MS_NOSUID|MS_NODEV|MS_RELATIME, "seclabel") = -1 EBUSY (Device or resource busy) 1300 write(4, "(00.018182) 1300: Error (criu/mount.c:1979): mnt: Unable to mount sysfs /tmp/.criu.mntns.1Sf3Gw/9-0000000000/sys (id=198): Device or resource busy\n", 149) = 149 1300 write(4, "(00.018203) 1300: Error (criu/mount.c:2044): mnt: Can't mount at /tmp/.criu.mntns.1Sf3Gw/9-0000000000/sys: Device or resource busy\n", 133) = 133 1300 write(4, "(00.018221) 1300: mnt: Start with 0:/tmp/.criu.mntns.1Sf3Gw\n", 62) = 62 1300 umount2("/tmp/cr-tmpfs.JncB1G", MNT_DETACH) = 0
I don't know how to handle sysfs... any idea? also an external mapping?
I hope the is an end in sight.
Giving your hosts root as --root
does not sound correct. EBUSY happens if there is already a /sys
mounted, which is true if you use your host's --root
. This sounds potentially dangerous.
From the original messages it seems like Singularity uses overlayfs for the root file system of the container. You should use that.
Anyway, as we told you in the beginning of this issue, you should try to include this into Singularity and not do it manually. Because now you need to recreate the steps that Singularity does to create the container's root file-system.
You can just use a random directory and copy the content from the overlayfs to that directory and use it as --root
.
-D /
looks wrong. You need to specify the directory where the checkpoint files are.
I tried another restore command with the overlay as the root directory and the external command: strace -o strace.log -s 256 -f criu restore -o restore.log -v4 -D ./ --shell-job --root /usr/local/var/singularity/mnt/session/final --external mnt[/etc/group]:/usr/local/var/singularity/mnt/session/final/etc --external mnt[/etc/passwd]:/usr/local/var/singularity/mnt/session/final/etc --external mnt[/etc/resolv.conf]:/usr/local/var/singularity/mnt/session/final/etc --external mnt[/var/tmp]:/usr/local/var/singularity/mnt/session/final/var --external mnt[/tmp]:/usr/local/var/singularity/mnt/session/final/tmp --external mnt[/home/node2]:/usr/local/var/singularity/mnt/session/final/home --external mnt[/proc/sys/fs/binfmt_misc]:/usr/local/var/singularity/mnt/session/final/proc --external mnt[/proc]:/usr/local/var/singularity/mnt/session/final/proc --external mnt[/etc/hosts]:/usr/local/var/singularity/mnt/session/final/etc --external mnt[/dev/mqueue]:/usr/local/var/singularity/mnt/session/final/dev --external mnt[/dev/hugepages]:/usr/local/var/singularity/mnt/session/final/dev --external mnt[/dev/pts]:/usr/local/var/singularity/mnt/session/final/dev --external mnt[/dev/shm]:/usr/local/var/singularity/mnt/session/final/dev --external mnt[/dev]:/usr/local/var/singularity/mnt/session/final/dev --external mnt[/etc/localtime]:/usr/local/var/singularity/mnt/session/final/etc
Sry that -D ./ was there but I printed the wrong command.
I'm running again into the sysfs mnt problem.
write(4, "(00.015814) 1278: mnt: \tMounting sysfs @/tmp/.criu.mntns.66aMK9/9-0000000000/sys (0)\n", 87) = 87 1278 mount("sysfs", "/tmp/.criu.mntns.66aMK9/9-0000000000/sys", "sysfs", MS_NOSUID|MS_NODEV|MS_RELATIME, "seclabel") = -1 ENOENT (No such file or directory) 1278 write(4, "(00.015901) 1278: Error (criu/mount.c:1979): mnt: Unable to mount sysfs /tmp/.criu.mntns.66aMK9/9-0000000000/sys (id=198): No such file or directory\n", 151) = 151 1278 write(4, "(00.015921) 1278: Error (criu/mount.c:2044): mnt: Can't mount at /tmp/.criu.mntns.66aMK9/9-0000000000/sys: No such file or directory\n", 135) = 135 1278 write(4, "(00.015939) 1278: mnt: Start with 0:/tmp/.criu.mntns.66aMK9\n", 62) = 62
Infos: restore.log strace.log
Does /usr/local/var/singularity/mnt/session/final have a sys directory?
The creation of the sys folder fixed it. Now the hopefully last problem...
1278 mmap(NULL, 8520, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fde08e0c000 1278 munmap(0x7fde08e0c000, 8520) = 0 1278 write(127, "(00.030746) 1278: Error (criu/files-reg.c:2104): File home/node2/container/matMult has bad mode 0100755 (expect 0100775)\n", 123) = 123 1278 write(127, "(00.030765) 1278: Error (criu/mem.c:1349): - Can't open vma\n", 63) = 63
The matMult is a compiled application.
A chmod should solve that.
Solved but again an error.
(00.028160) Error (criu/cr-restore.c:1931): Can't attach to 1278: Operation not permitted (00.028222) pie: 1278: seccomp: mode 0 on tid 1278 (00.028999) Error (criu/cr-restore.c:1986): Can't interrupt the 1278 task: No such process (00.029019) Error (criu/cr-restore.c:2372): Can't catch all tasks (00.029036) Error (criu/cr-restore.c:2420): Killing processes because of failure on restore. The Network was unlocked so some data or a connection may have been lost. (00.029867) Error (criu/mount.c:3385): mnt: Can't remove the directory /tmp/.criu.mntns.0XDuDp: No such file or directory (00.029904) Error (criu/cr-restore.c:2447): Restoring FAILED.
Before that error, it seems that ptrace is the problem.
1406 ptrace(PTRACE_SEIZE, 1398, NULL, 0) = -1 EPERM (Operation not permitted) 1406 write(4, "(00.033756) Error (criu/cr-restore.c:1931): Can't attach to 1398 : Operation not permitted\n", 90) = 90
Try it without strace.
There we go. It works π. I was a bit of a nobrainer there at the end.
@adrianreber Incredible thank you for your help. Sry, for the time you had to waste on me... π
I checkpointed a singularity container that runs a little program. Nothing complex with MPI or anything. As an info for any readers.
There we go. It works smile. I was a bit of a nobrainer there at the end.
Nice.
@adrianreber Incredible thank you for your help. Sry, for the time you had to waste on me... sweat_smile
We were making progress with each step so it always felt like it might work in the end.
I checkpointed a singularity container that runs a little program. Nothing complex with MPI or anything. As an info for any readers.
and you restored it, right? Would you be willing to document the commands used in our wiki (criu.org). How you started the container, how you checkpointed and how you restored it. Maybe easier to find than buried here in the ticket. If you have a chance to document it that would be great.
and you restored it, right? Would you be willing to document the commands used in our wiki (criu.org). How you started the container, how you checkpointed and how you restored it. Maybe easier to find than buried here in the ticket. If you have a chance to document it that would be great.
After your time consuming help, of course I can document that. π
Should I obtain a user account to edit the wiki?
Or is it maintained with github?
Yes, please create an account. I am not sure who has to approve it, but so far it usually happens fast.
@Snorch do you know who needs to approve wiki accounts?
do you know who needs to approve wiki accounts?
@kolyshkin Was doing that a the time I registered. But I'm a bit unsure.
A friendly reminder that this issue had no activity for 30 days.
@Wosch96 found this while attempting to get CRIU to dump/restore a singularity container. Were you ever able to write-up the commands used on the wiki? I searched but couldn't find anything.
If you still have any record of how this was done, perhaps you could post it here?
Hello guys,
I'm trying to checkpoint and restore a singularity container with criu. I get an error when dumping the container and maybe you could help me out. I'm running criu with the following command when trying to dump the container.
criu dump -o dump.log -v4 -t 7209 -D ./ --ext-mount-map /etc/resolv.conf:/etc/resolv.conf --ext-mount-map /etc/hosts:/etc/hosts --ext-mount-map /etc/hostname:/etc/hostname --ext-mount-map /var/tmp:/var/tmp --ext-mount-map /tmp:/tmp --ext-mount-map /root:/root --ext-mount-map /etc/localtime:/etc/localtime --ext-mount-map /tmp:/tmp --ext-mount-map /sys:/sys --ext-mount-map /proc:/proc --ext-mount-map /dev:/dev --ext-mount-map /dev/hugepages:/dev/hugepages --ext-mount-map /dev/mqueue:/dev/mqueue --ext-mount-map /dev/pts:/dev/pts --ext-mount-map /dev/shm:/dev/shm
This is the error that occurs in the dump.log.
(00.002619) Error (criu/files-reg.c:1629): Can't lookup mount=39 for fd=-3 path=/usr/local/libexec/singularity/bin/starter (00.002629) Error (criu/cr-dump.c:1262): Collect mappings (pid: 7209) failed with -1
Here is the whole dump.log.
Thank you in advance.