Trouble Running NVIDIA GPU Containers, `ldconfig failed`

Nauman3S commented 3 months ago

I'm experiencing difficulties running NVIDIA GPU containers. I encounter errors when attempting to run containers that utilize the GPU.

Issue Reproduction Steps:

Configuring the container runtime:

sudo nvidia-ctk runtime configure --runtime=containerd sudo systemctl restart containerd

Pulling images for testing:

sudo ctr images pull docker.io/nvidia/cuda:12.0.0-runtime-ubuntu20.04 sudo ctr images pull docker.io/nvidia/cuda:12.0.0-runtime-ubi8 sudo ctr images pull docker.io/nvidia/cuda:12.0.0-base-ubuntu20.04 sudo ctr images pull docker.io/nvidia/cuda:12.0.0-base-ubi8 Running a container with GPU:

sudo ctr run --rm --gpus 0 --runtime io.containerd.runc.v1 --privileged docker.io/nvidia/cuda:12.0.0-runtime-ubuntu20.04 test nvidia-smi

Error Message:

ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown This error persists across all pulled NVIDIA images(non-ubuntu based images show the same error but with /sbin/ldconfig instead of /sbin/ldconfig.real. However, non-GPU containers (e.g., docker.io/macabees/neofetch:latest) work without issues.

Further Details:

Running ldconfig -p shows 264 libs found, including various NVIDIA libraries while running ldconfig outputs no error.

Output from sudo nvidia-container-cli -k -d /dev/tty info includes warnings about missing libraries and compat32 libraries, although nvidia-smi shows the GPU is recognized correctly.

Attempted Solutions:

Verifying all NVIDIA driver and toolkit components are correctly installed. Ensuring the ldconfig cache is current and includes paths to the NVIDIA libraries and /sbin/ldconfig.real is a symlink to /sbin/ldconfig.

Despite these efforts, the error persists, and GPU containers fail to start. I'm seeking advice on resolving this ldcache and container initialization error to run NVIDIA GPU containers.

dwalkes commented 3 months ago

Hi,

Which branch, MACHINE and image are you using?
Have you tried tegra-demo-distro?

You can see the tests we run on meta-tegra images in the test spreadsheet.

Nauman3S commented 3 months ago

Hi,

It's image-full, branch is mickledore and it is orion.
I do have some un-related layers in my final build like neofetch but they are not interfering with any other layers.

The issue is, I need to use containerd instead of docker hence I removed docker recipe(s) from the build and with containerd I am getting this error although nothing related to kernel and nvidia-drivers has changed.

ichergui commented 3 months ago

Hi @Nauman3S

Could you please use nanbield branch instead of mickledore ? mickledore is deprecated branch.

Please share any findings when you are able to test with nanbield branch

ichergui commented 2 months ago

HI @Nauman3S Any update on this issue ?

ichergui commented 1 month ago

Closing this issue since no updates provided. Feel free to open new issue.

OE4T / meta-tegra