NVIDIA / cuda-checkpoint

CUDA checkpoint and restore utility
Other
223 stars 13 forks source link

annot checkpoint container: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1 #16

Open fscomfs opened 2 weeks ago

fscomfs commented 2 weeks ago

Description k8s 1.28 containerd 2.0

I want curl k8s checkpoint to create a container checkpoint

Steps to reproduce the issue:

curl -sk -X POST "https://127.0.0.1:10250/checkpoint/default/gpu-base-02/gpu-base-02" --key /etc/kubernetes/pki/apiserver-kubelet-client.key --cacert /etc/kubernetes/pki/ca.crt --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt Describe the results you received:

I want curl k8s checkpoint to create a container checkpoint

Describe the results you expected:

The actual situation is that an error occurs, showing: checkpointing of default/gpu-base-02/gpu-base-02 failed (rpc error: code = Unknown desc = checkpointing container "208a82339ddc590e460b89912304f56ad64924f89a959f982b17aeb6ab0c2aa8" failed: /usr/bin/nvidia-contain er-runtime did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/k8s.io/208a82339ddc590e460b89912304f56ad64924f89a959f982b17aeb6ab0c2aa8/criu-dump. log: unknown)

Additional information you deem important (e.g. issue happens only occasionally):

CRIU logs and information:

CRIU full dump/restore logs:

(00.011105) mnt: Inspecting sharing on 1494 shared_id 0 master_id 0 (@./proc/sys) (00.011109) mnt: Inspecting sharing on 1493 shared_id 0 master_id 0 (@./proc/irq) (00.011113) mnt: Inspecting sharing on 1492 shared_id 0 master_id 0 (@./proc/fs) (00.011116) mnt: Inspecting sharing on 1491 shared_id 0 master_id 0 (@./proc/bus) (00.011120) mnt: Inspecting sharing on 1611 shared_id 0 master_id 13 (@./proc/driver/nvidia/gpus/0000:b1:00.0) (00.011124) Error (criu/mount.c:1088): mnt: Mount 1611 ./proc/driver/nvidia/gpus/0000:b1:00.0 (master_id: 13 shared_id: 0) has unreachable sharing. Try --enable-external-masters. (00.011142) net: Unlock network (00.011146) Running network-unlock scripts (00.011149) RPC (00.072541) Unfreezing tasks into 1 (00.072552) Unseizing 1641382 into 1 (00.072562) Unseizing 1641424 into 1 (00.072568) Unseizing 1641533 into 1 (00.072580) Unseizing 1641475 into 1 (00.072586) Unseizing 1641500 into 1 (00.072599) Unseizing 2157578 into 1 (00.072632) Error (criu/cr-dump.c:2093): Dumping FAILED.

Output of criu --version:

Version: 3.18

Output of criu check --all:

Looks good.

Additional environment details:

rst0git commented 2 weeks ago

Version: 3.18

Hi @fscomfs, you would need to install the latest version of CRIU (v4.0) to enable support for CUDA checkpointing.

containerd 2.0

So far we have tested GPU checkpointing only with CRI-O. Would you be able to use CRI-O instead of containerd? Note that containerd doesn't currently support container restore (https://github.com/containerd/containerd/pull/10365).

Adrian (@adrianreber) and myself are working on enabling GPU checkpointing support with Kubernetes. You can reach out to us in the Kubernetes slack if you have any questions :)