Open Hurricane-eye opened 2 years ago
OS is Ubuntu18.04,GPU is NVIDIA GeForce RTX 3090,NVIDIA driver version is 470.82.01
------------------ 原始邮件 ------------------ 发件人: "NVIDIA/nvidia-docker" @.>; 发送时间: 2022年2月28日(星期一) 中午1:07 @.>; @.**@.>; 主题: Re: [NVIDIA/nvidia-docker] nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown (Issue NVIDIA/nvidia-container-toolkit#147)
@Hurricane-eye what is your host configuration (i.e. distribution and version)?
— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.***>
Maybe there are too many versions of driver in my server ?
dpkg -l | grep nvidia
ii libnvidia-cfg1-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-470-server 470.103.01-0ubuntu0.18.04.1 all Shared files used by the NVIDIA libraries
rc libnvidia-compute-470:amd64 470.86-0ubuntu0.18.04.1 amd64 NVIDIA libcompute package
ii libnvidia-compute-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.8.1-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.8.1-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 Extra libraries for the NVIDIA Server Driver
ii libnvidia-fbc1-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-ifr1-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
rc nvidia-compute-utils-470 470.86-0ubuntu0.18.04.1 amd64 NVIDIA compute utilities
ii nvidia-compute-utils-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA compute utilities
ii nvidia-container-toolkit 1.8.1-1 amd64 NVIDIA container runtime hook
rc nvidia-dkms-470 470.86-0ubuntu0.18.04.1 amd64 NVIDIA DKMS package
ii nvidia-dkms-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA DKMS package
ii nvidia-docker2 2.9.1-1 all nvidia-docker CLI wrapper
ii nvidia-driver-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA Server Driver metapackage
rc nvidia-kernel-common-470 470.86-0ubuntu0.18.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-common-470-server 470.103.01-0ubuntu0.18.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.16~0.18.04.1 all Tools to enable NVIDIA's Prime
ii nvidia-settings 470.57.01-0ubuntu0.18.04.1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA Server Driver support binaries
ii xserver-xorg-video-nvidia-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA binary Xorg driver
Hi! Have you found a solution for this yet?
I creeated the symlink manually:
ln -s /sbin/ldconfig /sbin/ldconfig.real
I had to do this inside the kind node:
docker exec -ti gpu-control-plane bash
ln -s /sbin/ldconfig /sbin/ldconfig.real
worked for me. It makes me wonder if maybe something needs to be set in the validator.
ln -s /sbin/ldconfig /sbin/ldconfig.real
worked for me. It makes me wonder if maybe something needs to be set in the validator.
What do you mean by validator. Note that on Ubuntu-based distributions where /sbin/ldconfig.real
is present, this is not a symlink, but the actual executable. /sbin/ldconfig
is wrapper script that injects DPGK update triggers before running ldconfig
. There is also an option in the /etc/nvidia-container-runtime/config.toml
that allows this to be specified to align with the expectation of the platform where the package is installed.
The next release of the NVIDIA Container Toolkit should allow these options to be detected in a more stable manner, ensuring that ldconfig.real
is only used if it is actually present.
@elezar Thank you for the information on this. This should help me get to the bottom of this.
What do you mean by validator?
I am installing the Nvidia GPU Operator on Kind. I was looking at some options to get GPUs working with my cluster. The operator's validator pod nvidia-operator-validator
goes into CrashLoopBackoff.
Logs show a failed sym link attempt:
driver-validation time="2023-06-09T23:20:43Z" level=info msg="Creating link /host-dev-char/234:271 => /dev/nvidia-caps/nvidia-cap271"
driver-validation time="2023-06-09T23:20:43Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap271 /host-dev-char/234:271: file exists"
Pod status shows the error with /sbin/ldconfig.real
:
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown
Creating a symlink "fixed" the error, but there is obviously more to it than that. Maybe there is an option with the Nvidia Toolkit that will resolve this.
I'm using version v1.14.3 and experiencing the same issue as reported by others.
Context: I'm using the GPU Operator, version v23.9.0
From https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.14.3/internal/config/config.go#L124-L129:
func getLdConfigPath() string {
if _, err := os.Stat("/sbin/ldconfig.real"); err == nil {
return "@/sbin/ldconfig.real"
}
return "@/sbin/ldconfig"
}
If I ssh into the node and check the existence of /sbin/ldconfig.real
:
stat /sbin/ldconfig.real
stat: cannot statx '/sbin/ldconfig.real': No such file or directory
But when looking at file /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
generated by the nvidia-container-toolkit
DaemonSet:
# ...
[nvidia-container-cli]
environment = []
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/"
# ...
It seems like function getLdConfigPath
is not working as expected.
The only solution to fix this problem is creating a symlink, as stated by others: sudo ln -s /sbin/ldconfig /sbin/ldconfig.real
.
@elezar is there another way to configure the ldconfig
element in config.toml
, or are we talking about a known issue in the getLdConfigPath
function?
@cmontemuino are you also trying to run the GPU Operator in Kind? If not, what is your host OS on the node where the NVIDIA Container Toolkit is being configured?
There may be an issue with how we're generating the config - Especially in the context of the GPU Operator - where we are detecting ldconfig.real in the ubuntu-based container instead of on the host.
Note that deleting (or commenting) that option from the config should cause the right value to be detected when running the NVIDIA Container Runtime from the host.
Hi @elezar, this is not KinD, but OracleLinux.
uname -r
5.14.0-284.30.1.el9_2.x86_64
We install Kubernetes (rancher/rke2) + the nvidia driver only. Then the gpu operator as an Argo CD application.
@cmontemuino other posters here have pointed out that they were using Kind. The symptom is the same though. Any host os where /sbin/ldconfig.real
does not exist will show this behavior when using the default ubuntu-based base image.
We should definitely make this more resillient, but for now you could consider switching to the container-toolkit:{{VERSION}}-ubi8
image as a workaround.
Just wanted to pop in to say that /sbin/ldconfig.real
doesn't exist on Debian 12 either. I have to symlink it for the gpu stuff to work properly.
On Nov 13, 2023, 9:18 AM, at 9:18 AM, Evan Lezar @.***> wrote:
@cmontemuino other posters here have pointed out that they were using Kind. The symptom is the same though. Any host os where
/sbin/ldconfig.real
does not exist will show this behavior when using the default ubuntu-based base image.We should definitely make this more resillient, but for now you could consider switching to the
container-toolkit:{{VERSION}}-ubi8
image as a workaround.-- Reply to this email directly or view it on GitHub: https://github.com/NVIDIA/nvidia-container-toolkit/issues/147#issuecomment-1808250370 You are receiving this because you are subscribed to this thread.
Message ID: @.***>
Yes, most (if not all) non-Ubuntu distributions don't have the ldconfig -> ldconfig.real
wrapper. This includes Debian 12. Since Debian is not an officially supported distribution under the GPU Operator this has not been a priority at present. Note that Kind uses debian-based images for the nodes which is why this is triggered there.
Just wanted to comment that I've been fighting all week to get a GPU working in my k3s cluster using containerd
The piece that made the entire thing come together was the missing symlink.
ln -s /sbin/ldconfig /sbin/ldconfig.real
Thank you!!!
My setup is as follows: 6 BM nodes, 3 of which are control plane nodes (Tiny FF so no PCIe slots for GPU access) 1 VM Node running on UnRaid w/ a RTX 2060 passed through.
OS: Fedora Linux 38 (Thirty Eight) Kernel: 6.7.4-100.fc38.x86_64 K3s: v1.28.3+k3s2 containerd: containerd://1.7.7-k3s1
@llajas which version of the NVIDIA Container Toolkit are you using?
@elezar
[root@metal6 ~]# nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.15.0-rc.3
commit: 93e15bc641896a9dc51f297c856c824bf1f45d86
I installed this using sudo yum install -y nvidia-container-toolkit
. I see in retrospect that this is a pre-release version, but it is working well for what I'm leveraging (Audio/Video transcoding across a statefulset).
Yes, most (if not all) non-Ubuntu distributions don't have the
ldconfig -> ldconfig.real
wrapper. This includes Debian 12. Since Debian is not an officially supported distribution under the GPU Operator this has not been a priority at present. Note that Kind uses debian-based images for the nodes which is why this is triggered there.
Good point! I'm on RHEL 8.9, which is supported, and I'm having the same issue. Was fixed by creating the symlink manually.
I have also been having this problem in Rocky 9.3 using Microk8s. The symlink work around fixes it.
1. Issue or feature description
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown.
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V