Failed to checkpoint dump container using GPU - Unable to connect a transport socket: Permission denied

ezerk commented 4 days ago

Hi my end goal is to checkpoint containers that use GPU during CI phase and later deploy it using kubernetes in order to reduce pod warmup time.

so far I experiment locally and was able to dump a GPU process (even relatively complexed one using the workaround suggested in https://github.com/NVIDIA/cuda-checkpoint/issues/4)

but when i run the app as a container it fails at early dump stages

Steps to reproduce:

following the provided example code in this repo counter.cu and wrapping it with a container using this Dockerfile:

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 as builder
COPY counter.cu .
RUN nvcc counter.cu -o /tmp/counter

FROM nvidia/cuda:12.4.1-base-ubuntu22.04 as main
WORKDIR /app
COPY --from=builder /tmp/counter /app/counter
EXPOSE 10000
ENTRYPOINT ["/app/counter"]

im using podman since it provides the option to checkpoint to an image which i hope to use later on

clone this repo and place Dockerfile under src folder
building local image sudo podman build -t counter .
run image sudo podman run -p 10000:10000/udp --gpus=all --name=counter counter
run nvidia-smi (on host machine) shows the PID as expected

PID: 192578 /app/counter

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   74C    P0             32W /   70W |     103MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    192578      C   /app/counter                                  100MiB |
+-----------------------------------------------------------------------------------------+

run sudo cuda-checkpoint --toggle --pid $PID successfully
validate with nvidia-smi that process is offloaded from GPU

output: `No running processes found` as expected

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   74C    P0             33W /   70W |       1MiB /  15360MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

trying to checkout using podman sudo podman container checkpoint counter --create-image=counter_chechpoint fails. (using other flags provided by podman checkpoint resulted in the same error)
```
CRIU checkpointing failed -52.  Please check CRIU logfile .. 
```

662 (00.007387) Error (criu/mount.c:757): mnt: 637:./usr/lib/firmware/nvidia/550.127.05/gsp_tu10x.bin doesn't hav e a proper root mount 663 (00.007403) net: Unlock network 664 (00.007407) Running network-unlock scripts 665 (00.024091) Unfreezing tasks into 1 666 (00.024103) Unseizing 192578 into 1 667 (00.024161) Error (criu/cr-dump.c:2111): Dumping FAILED.

- trying to use underlaying `criu` command directly `sudo criu dump  --shell-job --images-dir dump --external 'mnt[]:sm'  -vvvv -o dump2.log` I can get some progress (due to `--external 'mnt[]:sm'` flag) but still fail, this time with:

(00.017847) Putting tsock into pid 192578 (00.018191) Error (compel/src/lib/infect.c:713): Unable to connect a transport socket: Permission denied (00.018200) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process (00.018202) Error (compel/src/lib/ptrace.c:96): Can't poke 192578 @ 0x55887f472000 from 0x7ffc2ad59a58 sized 8 (00.018204) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process (00.018206) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke (00.018207) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 192578) (00.018209) Warn (criu/parasite-syscall.c:439): Can't cure failed infection (00.018215) Error (criu/cr-dump.c:1610): Can't infect (pid: 192578) with parasite (00.018276) net: Unlock network (00.018279) Running network-unlock scripts (00.034182) Unfreezing tasks into 1 (00.034194) Unseizing 192578 into 1 (00.034200) Error (compel/src/lib/infect.c:418): Unable to detach from 192578: No such process (00.034239) Error (criu/cr-dump.c:2111): Dumping FAILED.

note - same error reproduced with both rpm installed criu (criu-3.19-1.el9.x86_64) and with criu v4 compiled from [latest commit](https://github.com/checkpoint-restore/criu/tree/f6baf8143b8b6490c7fca5d7a9cf948b5f5ed02c) from criu-dev branch

complete criu log files:
- [criu_dump_v4.log](https://github.com/user-attachments/files/17752803/criu_dump_v4.log)
- [criu_dump_v3.19.log](https://github.com/user-attachments/files/17752802/criu_dump_v3.19.log)

##### Spec
<details>
<summary>criu v3.19 (rpm version)</summary>

$ criu_ORIG --version Version: 3.19


```bash
$ sudo criu_ORIG check --all
Warn  (criu/cr-check.c:1346): Nftables based locking requires libnftables and set concatenations support
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.

criu v4 compiled version

```bash $ criu --version Version: 4.0 GitID: v4.0-23-gf6baf8143 ``` ```bash $ sudo criu check --all Warn (criu/cr-check.c:1348): Nftables based locking requires libnftables and set concatenations support Error (criu/cr-check.c:1553): unmatched dev:ino 0:38:9 (expected 0:39:9) Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. ```

CentOS host details:

```bash $ uname -mor 5.14.0-522.el9.x86_64 x86_64 GNU/Linux $ cat /etc/system-release CentOS Stream release 9 ```

rst0git commented 4 days ago

Driver Version: 550.127.05

@ezerk Would you be able to update your driver version to 555 or 560?

The following readme file provides more information: https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/cuda

ezerk commented 4 days ago

@rst0git - thanks for the supper fast reply - much appreciated ! I will give it a try and update

ezerk commented 8 hours ago

upgraded driver on host twice to 555 and 560 - still getting the same error (tried with both criu v3.19 and v4)

see driver details

nvidia-driver.x86_64 3:555.42.06-1.el9 @cuda-rhel9-x86_64

``` dnf module install nvidia-driver:555-open +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ ``` I've tried running image based on `nvidia/cuda:12.6.2-base-ubuntu24.04` or on `nvidia/cuda:12.5.0-base-ubuntu24.04`

nvidia-driver.x86_64 3:560.35.03-1.el9 @cuda-rhel9-x86_64

``` dnf module install nvidia-driver:560-open +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ ```

rst0git commented 7 hours ago

(00.018191) Error (compel/src/lib/infect.c:713): Unable to connect a transport socket: Permission denied
(00.018200) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018202) Error (compel/src/lib/ptrace.c:96): Can't poke 192578 @ 0x55887f472000 from 0x7ffc2ad59a58 sized 8
(00.018204) Error (compel/src/lib/ptrace.c:73): POKEDATA failed: No such process
(00.018206) Error (compel/src/lib/ptrace.c:100): Can't restore the original data with poke
(00.018207) Error (compel/src/lib/infect.c:637): Can't inject syscall blob (pid: 192578)
(00.018209) Warn  (criu/parasite-syscall.c:439): Can't cure failed infection

This error above is unrelated. It occurs because selinux is enabled and prevents the CRIU parasite code from writing to the log file descriptor.

trying to use underlaying criu command directly

I would not recommend this approach because specifying all required CRIU options would be very challenging.

@ezerk Would you be able to try the following with CRIU v4.0 and CUDA plugin?

mkdir -p /etc/criu
echo -e "tcp-established\nghost-limit=100M\ntimeout=300" | sudo tee /etc/criu/runc.conf
sed -i 's/#runtime = "crun"/runtime = "runc"/' /usr/share/containers/containers.conf

sudo podman run -d --name cuda-counter --device nvidia.com/gpu=all --security-opt=label=disable \
        quay.io/radostin/cuda-counter

sudo podman logs -l
sudo podman container checkpoint -l -e /tmp/test.tar
podman rm -f cuda-counter
sudo podman container restore -i /tmp/test.tar
sudo podman logs -l

ezerk commented 4 hours ago

it worked ! changing /usr/share/containers/containers.conf runtime = "runc" this flag is also mandatory podman run --security-opt=label=disable

changes to /etc/criu/runc.conf does not seem to be mandatory

i will update later on about my experience with more complexed applications Many thanks

NVIDIA / cuda-checkpoint

Failed to checkpoint dump container using GPU - Unable to connect a transport socket: Permission denied #19

Steps to reproduce: