Open ezerk opened 4 days ago
Driver Version: 550.127.05
@ezerk Would you be able to update your driver version to 555 or 560?
The following readme file provides more information: https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/cuda
@rst0git - thanks for the supper fast reply - much appreciated ! I will give it a try and update
upgraded driver on host twice to 555 and 560 - still getting the same error (tried with both criu v3.19 and v4)
see driver details
nvidia-driver.x86_64 3:555.42.06-1.el9 @cuda-rhel9-x86_64
nvidia-driver.x86_64 3:560.35.03-1.el9 @cuda-rhel9-x86_64
(00.018191) Error (compel/src/lib/infect.c:713): Unable to connect a transport socket: Permission denied
(00.018200) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018202) Error (compel/src/lib/ptrace.c:96): Can't poke 192578 @ 0x55887f472000 from 0x7ffc2ad59a58 sized 8
(00.018204) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018206) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke
(00.018207) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 192578)
(00.018209) Warn (criu/parasite-syscall.c:439): Can't cure failed infection
This error above is unrelated. It occurs because selinux is enabled and prevents the CRIU parasite code from writing to the log file descriptor.
trying to use underlaying criu command directly
I would not recommend this approach because specifying all required CRIU options would be very challenging.
@ezerk Would you be able to try the following with CRIU v4.0 and CUDA plugin?
mkdir -p /etc/criu
echo -e "tcp-established\nghost-limit=100M\ntimeout=300" | sudo tee /etc/criu/runc.conf
sed -i 's/#runtime = "crun"/runtime = "runc"/' /usr/share/containers/containers.conf
sudo podman run -d --name cuda-counter --device nvidia.com/gpu=all --security-opt=label=disable \
quay.io/radostin/cuda-counter
sudo podman logs -l
sudo podman container checkpoint -l -e /tmp/test.tar
podman rm -f cuda-counter
sudo podman container restore -i /tmp/test.tar
sudo podman logs -l
it worked !
changing /usr/share/containers/containers.conf
runtime = "runc"
this flag is also mandatory podman run --security-opt=label=disable
changes to /etc/criu/runc.conf
does not seem to be mandatory
i will update later on about my experience with more complexed applications Many thanks
Hi my end goal is to checkpoint containers that use GPU during CI phase and later deploy it using kubernetes in order to reduce pod warmup time.
so far I experiment locally and was able to dump a GPU process (even relatively complexed one using the workaround suggested in https://github.com/NVIDIA/cuda-checkpoint/issues/4)
but when i run the app as a container it fails at early dump stages
Steps to reproduce:
following the provided example code in this repo counter.cu and wrapping it with a container using this
Dockerfile
:im using podman since it provides the option to checkpoint to an image which i hope to use later on
src
foldersudo podman build -t counter .
sudo podman run -p 10000:10000/udp --gpus=all --name=counter counter
nvidia-smi
(on host machine) shows the PID as expectedPID:
192578 /app/counter
sudo cuda-checkpoint --toggle --pid $PID
successfullynvidia-smi
that process is offloaded from GPUoutput: `No running processes found` as expected
sudo podman container checkpoint counter --create-image=counter_chechpoint
fails. (using other flags provided by podman checkpoint resulted in the same error)662 (00.007387) Error (criu/mount.c:757): mnt: 637:./usr/lib/firmware/nvidia/550.127.05/gsp_tu10x.bin doesn't hav e a proper root mount 663 (00.007403) net: Unlock network 664 (00.007407) Running network-unlock scripts 665 (00.024091) Unfreezing tasks into 1 666 (00.024103) Unseizing 192578 into 1 667 (00.024161) Error (criu/cr-dump.c:2111): Dumping FAILED.
(00.017847) Putting tsock into pid 192578 (00.018191) Error (compel/src/lib/infect.c:713): Unable to connect a transport socket: Permission denied (00.018200) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process (00.018202) Error (compel/src/lib/ptrace.c:96): Can't poke 192578 @ 0x55887f472000 from 0x7ffc2ad59a58 sized 8 (00.018204) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process (00.018206) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke (00.018207) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 192578) (00.018209) Warn (criu/parasite-syscall.c:439): Can't cure failed infection (00.018215) Error (criu/cr-dump.c:1610): Can't infect (pid: 192578) with parasite (00.018276) net: Unlock network (00.018279) Running network-unlock scripts (00.034182) Unfreezing tasks into 1 (00.034194) Unseizing 192578 into 1 (00.034200) Error (compel/src/lib/infect.c:418): Unable to detach from 192578: No such process (00.034239) Error (criu/cr-dump.c:2111): Dumping FAILED.
$ criu_ORIG --version Version: 3.19
criu v4 compiled version
```bash $ criu --version Version: 4.0 GitID: v4.0-23-gf6baf8143 ``` ```bash $ sudo criu check --all Warn (criu/cr-check.c:1348): Nftables based locking requires libnftables and set concatenations support Error (criu/cr-check.c:1553): unmatched dev:ino 0:38:9 (expected 0:39:9) Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. ```CentOS host details:
```bash $ uname -mor 5.14.0-522.el9.x86_64 x86_64 GNU/Linux $ cat /etc/system-release CentOS Stream release 9 ```