NVIDIA / libnvidia-container

NVIDIA container runtime library
Apache License 2.0
759 stars 189 forks source link

Trouble Running NVIDIA GPU Containers on Custom Yocto-Based Distro on HPE Server with NVIDIA A40 GPU #257

Open Nauman3S opened 3 months ago

Nauman3S commented 3 months ago

I'm experiencing difficulties running NVIDIA GPU containers on a custom Yocto-based distribution tailored for an HPE server equipped with an NVIDIA A40 GPU. Despite having set up a custom meta-nvidia layer (mickledore branch), which includes recipes for NVIDIA drivers, libnvidia-container, libtirpc, and nvidia-container-toolkit (based on meta-tegra's recipes-containers layer at OE4T/meta-tegra), I encounter errors when attempting to run containers that utilize the GPU.

Distro Details:

Distro: poky Included Recipes and Layers: containerd, virtualization layers, NVIDIA drivers and kernel modules, systemd, kernel headers, etc.

Issue Reproduction Steps:

Configuring the container runtime:

sudo nvidia-ctk runtime configure --runtime=containerd sudo systemctl restart containerd

Pulling images for testing:

sudo ctr images pull docker.io/nvidia/cuda:12.0.0-runtime-ubuntu20.04 sudo ctr images pull docker.io/nvidia/cuda:12.0.0-runtime-ubi8 sudo ctr images pull docker.io/nvidia/cuda:12.0.0-base-ubuntu20.04 sudo ctr images pull docker.io/nvidia/cuda:12.0.0-base-ubi8 Running a container with GPU:

sudo ctr run --rm --gpus 0 --runtime io.containerd.runc.v1 --privileged docker.io/nvidia/cuda:12.0.0-runtime-ubuntu20.04 test nvidia-smi

Error Message:

ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown This error persists across all pulled NVIDIA images(non-ubuntu based images show the same error but with /sbin/ldconfig instead of /sbin/ldconfig.real. However, non-GPU containers (e.g., docker.io/macabees/neofetch:latest) work without issues.

Further Details:

Running ldconfig -p shows 264 libs found, including various NVIDIA libraries while running ldconfig outputs no error.

uname -a: Kernel version: Linux intel-corei7-64-02 6.1.38-intel-pk-standard #1 SMP PREEMPT_DYNAMIC Thu Jul 13 04:53:52 UTC 2023 x86_64 GNU/Linux

Output from sudo nvidia-container-cli -k -d /dev/tty info includes warnings about missing libraries and compat32 libraries, although nvidia-smi shows the GPU is recognized correctly.

Attempted Solutions:

Verifying all NVIDIA driver and toolkit components are correctly installed. Ensuring the ldconfig cache is current and includes paths to the NVIDIA libraries and /sbin/ldconfig.real is a symlink to /sbin/ldconfig.

Despite these efforts, the error persists, and GPU containers fail to start. I'm seeking advice on resolving this ldcache and container initialization error to run NVIDIA GPU containers on this custom Yocto-based distribution.

elezar commented 3 months ago

@Nauman3S which verison of the NVIDIA Container Toolkit are you using?

The issue is most likely that the ldconfig entry in the /etc/nvidia-container-runtime/config.toml file is incorrect. We should be resolving the proper ldconfig path and only using ldconfig.real if it actually exists (e.g. on ubuntu systems).

Could you also provide the content of the config file?

Nauman3S commented 3 months ago

@elezar my /etc/nvidia-container-runtime/config.toml file:

disable-require = false
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
debug = "/var/log/nvidia-container-toolkit.log"
ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = true
#user = "root:video"
ldconfig = "/sbin/ldconfig"
#alpha-merge-visible-devices-envvars = false

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
log-level = "debug"

# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
        "runc",
]

mode = "auto"

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

I also tried it with ldconfig = "@/sbin/ldconfig" and ldconfig = "@/sbin/ldconfig.real" but the result is same. And the nvidia-ctk -v outputs

nvidia-ctk -v 
NVIDIA Container Toolkit CLI version 1.14.6
commit: 4668c511de4b311c96bc3dd0310bff40b75083bd
elezar commented 3 months ago

Could you replace

ldconfig = "/sbin/ldconfig"

with

ldconfig = "@/sbin/ldconfig"

The @ indicates that the path is on the host -- meaning that the ldconfig on the host is used to update the ldcache in the container.

Nauman3S commented 3 months ago

Thank you. I already changed it to use @ prefix but still get the same error. I'm wondering is there any way to get more verbose logs from the cli?

elezar commented 3 months ago

@Nauman3S unfortunately there aren't too many logs available for that specific part of the code. One thing that you could try is whether the error perists when the v1.15.0-rc.4 release candidate is used. We have made some changes to the normalization of the ldconfig path. Here it is still recommended to use @/sbin/ldconfig in the config file.

I will have a look to see if there's anything obvious that's amis with how things are being handled.