NVIDIA / cuda-checkpoint

CUDA checkpoint and restore utility
Other
204 stars 10 forks source link

Error on restoring application in docker containers with partial GPU passthrough #12

Open alexfrolov opened 2 months ago

alexfrolov commented 2 months ago

Hi!

I have been testing cuda-checkpoint for CR inside docker containers and found out that when not all nvidia devices are used in the container cuda-checkpoint fails to restore application, while it works perfectly well when all --gpus all is specified. It seems that the driver does not support this scenario. Is it possible to add this functionality?

Best, Alex

alexndrfrolov@cuda-cr-v100:~$ sudo docker run --rm -ti --gpus '"device=0"' -v /home/alexndrfrolov/cuda-checkpoint:/cuda-checkpoint nvcr.io/nvidia/cuda:12.5.1-runtime-ubuntu22.04 bash                                                             

==========
== CUDA ==
==========

CUDA Version 12.5.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

root@e7c03f85752c:/# /cuda-checkpoint/src/counter_hot &
[1] 29
root@e7c03f85752c:/# /cuda-checkpoint/bin/x86_64_Linux/cuda-checkpoint --pid 29 --action lock
root@e7c03f85752c:/# /cuda-checkpoint/bin/x86_64_Linux/cuda-checkpoint --pid 29 --action checkpoint
root@e7c03f85752c:/# /cuda-checkpoint/bin/x86_64_Linux/cuda-checkpoint --pid 29 --action restore
Could not restore on process ID 29: "OS call failed or operation not supported on this OS"
alexfrolov commented 1 month ago

Hi!

Some more information on that issue:

Base OS: Ubuntu 22.04.4 LTS
Kernel version: 5.15.0-118-generic
Drivers: 555.42.06, 560.28.03
Docker image: nvcr.io/nvidia/cuda:12.5.1-devel-ubuntu22.04
Server: Docker Engine - Community
 Engine:
  Version:          27.1.1
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.12
  Git commit:       cc13f95
  Built:            Tue Jul 23 19:57:01 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.19
  GitCommit:        2bf793ef6dc9a18e00cb12efb64355c2c9d5eb41
 runc:
  Version:          1.7.19
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
jesus-ramos commented 1 month ago

We're looking to add support for this in a future release.