Closed Nauman3S closed 1 month ago
Hi,
MACHINE
and image are you using?You can see the tests we run on meta-tegra images in the test spreadsheet.
Hi,
image-full
, branch is mickledore
and it is orion
.neofetch
but they are not interfering with any other layers.The issue is, I need to use containerd
instead of docker
hence I removed docker
recipe(s) from the build and with containerd
I am getting this error although nothing related to kernel and nvidia-drivers has changed.
Hi @Nauman3S
Could you please use nanbield
branch instead of mickledore
?
mickledore
is deprecated branch.
Please share any findings when you are able to test with nanbield
branch
HI @Nauman3S Any update on this issue ?
Closing this issue since no updates provided. Feel free to open new issue.
I'm experiencing difficulties running NVIDIA GPU containers. I encounter errors when attempting to run containers that utilize the GPU.
Issue Reproduction Steps:
Configuring the container runtime:
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
Pulling images for testing:
sudo ctr images pull docker.io/nvidia/cuda:12.0.0-runtime-ubuntu20.04
sudo ctr images pull docker.io/nvidia/cuda:12.0.0-runtime-ubi8
sudo ctr images pull docker.io/nvidia/cuda:12.0.0-base-ubuntu20.04
sudo ctr images pull docker.io/nvidia/cuda:12.0.0-base-ubi8
Running a container with GPU:sudo ctr run --rm --gpus 0 --runtime io.containerd.runc.v1 --privileged docker.io/nvidia/cuda:12.0.0-runtime-ubuntu20.04 test nvidia-smi
Error Message:
ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown This error persists across all pulled NVIDIA images(non-ubuntu based images show the same error but with /sbin/ldconfig instead of /sbin/ldconfig.real. However, non-GPU containers (e.g., docker.io/macabees/neofetch:latest) work without issues.
Further Details:
Running ldconfig -p shows 264 libs found, including various NVIDIA libraries while running ldconfig outputs no error.
Output from
sudo nvidia-container-cli -k -d /dev/tty info
includes warnings about missing libraries and compat32 libraries, although nvidia-smi shows the GPU is recognized correctly.Attempted Solutions:
Verifying all NVIDIA driver and toolkit components are correctly installed. Ensuring the ldconfig cache is current and includes paths to the NVIDIA libraries and /sbin/ldconfig.real is a symlink to /sbin/ldconfig.
Despite these efforts, the error persists, and GPU containers fail to start. I'm seeking advice on resolving this ldcache and container initialization error to run NVIDIA GPU containers.