NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.23k stars 243 forks source link

nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown #147

Open Hurricane-eye opened 2 years ago

Hurricane-eye commented 2 years ago

1. Issue or feature description

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown.

2. Steps to reproduce the issue

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi 

3. Information to attach (optional if deemed irrelevant)

uname -a
Linux labpano-ThinkStation-P620 5.4.0-91-generic NVIDIA/nvidia-docker#102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I0226 13:18:46.555553 36161 nvc.c:376] initializing library context (version=1.8.1, build=abd4e14d8cb923e2a70b7dcfee55fbc16bffa353)
I0226 13:18:46.555594 36161 nvc.c:350] using root /
I0226 13:18:46.555599 36161 nvc.c:351] using ldcache /etc/ld.so.cache
I0226 13:18:46.555610 36161 nvc.c:352] using unprivileged user 1000:1000
I0226 13:18:46.555628 36161 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0226 13:18:46.555726 36161 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0226 13:18:46.557507 36162 nvc.c:273] failed to set inheritable capabilities
W0226 13:18:46.557561 36162 nvc.c:274] skipping kernel modules load due to failure
I0226 13:18:46.557869 36163 rpc.c:71] starting driver rpc service
I0226 13:18:46.560107 36164 rpc.c:71] starting nvcgo rpc service
I0226 13:18:46.560936 36161 nvc_info.c:759] requesting driver information with ''
I0226 13:18:46.562288 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.82.01
I0226 13:18:46.562352 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.82.01
I0226 13:18:46.562382 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.82.01
I0226 13:18:46.562413 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.82.01
I0226 13:18:46.562454 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.82.01
I0226 13:18:46.562494 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.82.01
I0226 13:18:46.562525 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.82.01
I0226 13:18:46.562554 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.82.01
I0226 13:18:46.562597 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.82.01
I0226 13:18:46.562637 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.82.01
I0226 13:18:46.562664 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.82.01
I0226 13:18:46.562692 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.82.01
I0226 13:18:46.562722 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.82.01
I0226 13:18:46.562762 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.82.01
I0226 13:18:46.562801 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.82.01
I0226 13:18:46.562832 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.82.01
I0226 13:18:46.562862 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.82.01
I0226 13:18:46.562902 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.82.01
I0226 13:18:46.562931 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.82.01
I0226 13:18:46.562971 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.82.01
I0226 13:18:46.563288 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.82.01
I0226 13:18:46.563432 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.82.01
I0226 13:18:46.563465 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.82.01
I0226 13:18:46.563495 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.82.01
I0226 13:18:46.563525 36161 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.82.01
I0226 13:18:46.563579 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.470.82.01
I0226 13:18:46.563609 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.470.82.01
I0226 13:18:46.563650 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.470.82.01
I0226 13:18:46.563690 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.470.82.01
I0226 13:18:46.563719 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.470.82.01
I0226 13:18:46.563761 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-ifr.so.470.82.01
I0226 13:18:46.563801 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.470.82.01
I0226 13:18:46.563829 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.470.82.01
I0226 13:18:46.563858 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.470.82.01
I0226 13:18:46.563888 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.470.82.01
I0226 13:18:46.563927 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.470.82.01
I0226 13:18:46.563991 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.470.82.01
I0226 13:18:46.564022 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.470.82.01
I0226 13:18:46.564057 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.470.82.01
I0226 13:18:46.564113 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libcuda.so.470.82.01
I0226 13:18:46.564165 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.470.82.01
I0226 13:18:46.564196 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.470.82.01
I0226 13:18:46.564224 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.470.82.01
I0226 13:18:46.564253 36161 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.470.82.01
W0226 13:18:46.564271 36161 nvc_info.c:398] missing library libnvidia-nscq.so
W0226 13:18:46.564278 36161 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so
W0226 13:18:46.564284 36161 nvc_info.c:398] missing library libnvidia-pkcs11.so
W0226 13:18:46.564290 36161 nvc_info.c:398] missing library libvdpau_nvidia.so
W0226 13:18:46.564295 36161 nvc_info.c:402] missing compat32 library libnvidia-cfg.so
W0226 13:18:46.564301 36161 nvc_info.c:402] missing compat32 library libnvidia-nscq.so
W0226 13:18:46.564307 36161 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so
W0226 13:18:46.564312 36161 nvc_info.c:402] missing compat32 library libnvidia-allocator.so
W0226 13:18:46.564318 36161 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so
W0226 13:18:46.564324 36161 nvc_info.c:402] missing compat32 library libnvidia-ngx.so
W0226 13:18:46.564330 36161 nvc_info.c:402] missing compat32 library libvdpau_nvidia.so
W0226 13:18:46.564336 36161 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so
W0226 13:18:46.564341 36161 nvc_info.c:402] missing compat32 library libnvoptix.so
W0226 13:18:46.564347 36161 nvc_info.c:402] missing compat32 library libnvidia-cbl.so
I0226 13:18:46.566161 36161 nvc_info.c:298] selecting /usr/bin/nvidia-smi
I0226 13:18:46.566184 36161 nvc_info.c:298] selecting /usr/bin/nvidia-debugdump
I0226 13:18:46.566203 36161 nvc_info.c:298] selecting /usr/bin/nvidia-persistenced
I0226 13:18:46.566231 36161 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-control
I0226 13:18:46.566249 36161 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-server
W0226 13:18:46.566306 36161 nvc_info.c:424] missing binary nv-fabricmanager
I0226 13:18:46.566332 36161 nvc_info.c:342] listing firmware path /lib/firmware/nvidia/470.82.01/gsp.bin
I0226 13:18:46.566360 36161 nvc_info.c:522] listing device /dev/nvidiactl
I0226 13:18:46.566368 36161 nvc_info.c:522] listing device /dev/nvidia-uvm
I0226 13:18:46.566375 36161 nvc_info.c:522] listing device /dev/nvidia-uvm-tools
I0226 13:18:46.566381 36161 nvc_info.c:522] listing device /dev/nvidia-modeset
I0226 13:18:46.566407 36161 nvc_info.c:342] listing ipc path /run/nvidia-persistenced/socket
W0226 13:18:46.566428 36161 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket
W0226 13:18:46.566444 36161 nvc_info.c:348] missing ipc path /tmp/nvidia-mps
I0226 13:18:46.566451 36161 nvc_info.c:815] requesting device information with ''
I0226 13:18:46.572480 36161 nvc_info.c:706] listing device /dev/nvidia0 (GPU-92867a25-5ca2-f89b-4bf8-61ba2049a538 at 00000000:61:00.0)
NVRM version:   470.82.01
CUDA version:   11.4

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce RTX 3090
Brand:          GeForce
GPU UUID:       GPU-92867a25-5ca2-f89b-4bf8-61ba2049a538
Bus Location:   00000000:61:00.0
Architecture:   8.6
I0226 13:18:46.572503 36161 nvc.c:430] shutting down library context
I0226 13:18:46.572528 36164 rpc.c:95] terminating nvcgo rpc service
I0226 13:18:46.573021 36161 rpc.c:135] nvcgo rpc service terminated successfully
I0226 13:18:46.573749 36163 rpc.c:95] terminating driver rpc service
I0226 13:18:46.573874 36161 rpc.c:135] driver rpc service terminated successfully

nvidia-smi -a
==============NVSMI LOG==============

Timestamp                                 : Sat Feb 26 21:20:02 2022
Driver Version                            : 470.82.01
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:61:00.0
    Product Name                          : NVIDIA GeForce RTX 3090
    Product Brand                         : GeForce
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-92867a25-5ca2-f89b-4bf8-61ba2049a538
    Minor Number                          : 0
    VBIOS Version                         : 94.02.59.00.D6
    MultiGPU Board                        : No
    Board ID                              : 0x6100
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x61
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220410DE
        Bus Id                            : 00000000:61:00.0
        Sub System Id                     : 0x38801028
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24259 MiB
        Used                              : 222 MiB
        Free                              : 24037 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 22 MiB
        Free                              : 234 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 41 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 28.25 W
        Power Limit                       : 350.00 W
        Default Power Limit               : 350.00 W
        Enforced Power Limit              : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 737.500 mV
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1472
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 18 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1632
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 74 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1925
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 99 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2042
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 27 MiB

sudo docker --version
Docker version 20.10.12, build e91ed57
Hurricane-eye commented 2 years ago

OS is Ubuntu18.04,GPU is NVIDIA GeForce RTX 3090,NVIDIA driver version is 470.82.01

------------------ 原始邮件 ------------------ 发件人: "NVIDIA/nvidia-docker" @.>; 发送时间: 2022年2月28日(星期一) 中午1:07 @.>; @.**@.>; 主题: Re: [NVIDIA/nvidia-docker] nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown (Issue NVIDIA/nvidia-container-toolkit#147)

@Hurricane-eye what is your host configuration (i.e. distribution and version)?

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.***>

Hurricane-eye commented 2 years ago

Maybe there are too many versions of driver in my server ?

dpkg -l | grep nvidia
ii  libnvidia-cfg1-470-server:amd64            470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-470-server                470.103.01-0ubuntu0.18.04.1                     all          Shared files used by the NVIDIA libraries
rc  libnvidia-compute-470:amd64                470.86-0ubuntu0.18.04.1                         amd64        NVIDIA libcompute package
ii  libnvidia-compute-470-server:amd64         470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA libcompute package
ii  libnvidia-container-tools                  1.8.1-1                                         amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                 1.8.1-1                                         amd64        NVIDIA container runtime library
ii  libnvidia-decode-470-server:amd64          470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-470-server:amd64          470.103.01-0ubuntu0.18.04.1                     amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-470-server:amd64           470.103.01-0ubuntu0.18.04.1                     amd64        Extra libraries for the NVIDIA Server Driver
ii  libnvidia-fbc1-470-server:amd64            470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-470-server:amd64              470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-ifr1-470-server:amd64            470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA OpenGL-based Inband Frame Readback runtime library
rc  nvidia-compute-utils-470                   470.86-0ubuntu0.18.04.1                         amd64        NVIDIA compute utilities
ii  nvidia-compute-utils-470-server            470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA compute utilities
ii  nvidia-container-toolkit                   1.8.1-1                                         amd64        NVIDIA container runtime hook
rc  nvidia-dkms-470                            470.86-0ubuntu0.18.04.1                         amd64        NVIDIA DKMS package
ii  nvidia-dkms-470-server                     470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA DKMS package
ii  nvidia-docker2                             2.9.1-1                                         all          nvidia-docker CLI wrapper
ii  nvidia-driver-470-server                   470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA Server Driver metapackage
rc  nvidia-kernel-common-470                   470.86-0ubuntu0.18.04.1                         amd64        Shared files used with the kernel module
ii  nvidia-kernel-common-470-server            470.103.01-0ubuntu0.18.04.1                     amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-470-server            470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA kernel source package
ii  nvidia-prime                               0.8.16~0.18.04.1                                all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                            470.57.01-0ubuntu0.18.04.1                      amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-470-server                    470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA Server Driver support binaries
ii  xserver-xorg-video-nvidia-470-server       470.103.01-0ubuntu0.18.04.1                     amd64        NVIDIA binary Xorg driver
vishnuu95 commented 1 year ago

Hi! Have you found a solution for this yet?

samos123 commented 1 year ago

I creeated the symlink manually:

ln -s /sbin/ldconfig /sbin/ldconfig.real

I had to do this inside the kind node:

docker exec -ti gpu-control-plane bash
lukeogg commented 1 year ago

ln -s /sbin/ldconfig /sbin/ldconfig.real worked for me. It makes me wonder if maybe something needs to be set in the validator.

elezar commented 1 year ago

ln -s /sbin/ldconfig /sbin/ldconfig.real worked for me. It makes me wonder if maybe something needs to be set in the validator.

What do you mean by validator. Note that on Ubuntu-based distributions where /sbin/ldconfig.real is present, this is not a symlink, but the actual executable. /sbin/ldconfig is wrapper script that injects DPGK update triggers before running ldconfig. There is also an option in the /etc/nvidia-container-runtime/config.toml that allows this to be specified to align with the expectation of the platform where the package is installed.

The next release of the NVIDIA Container Toolkit should allow these options to be detected in a more stable manner, ensuring that ldconfig.real is only used if it is actually present.

lukeogg commented 1 year ago

@elezar Thank you for the information on this. This should help me get to the bottom of this.

What do you mean by validator?

I am installing the Nvidia GPU Operator on Kind. I was looking at some options to get GPUs working with my cluster. The operator's validator pod nvidia-operator-validator goes into CrashLoopBackoff.

Logs show a failed sym link attempt:

driver-validation time="2023-06-09T23:20:43Z" level=info msg="Creating link /host-dev-char/234:271 => /dev/nvidia-caps/nvidia-cap271"
driver-validation time="2023-06-09T23:20:43Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap271 /host-dev-char/234:271: file exists"

Pod status shows the error with /sbin/ldconfig.real:

Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown

Creating a symlink "fixed" the error, but there is obviously more to it than that. Maybe there is an option with the Nvidia Toolkit that will resolve this.

cmontemuino commented 10 months ago

I'm using version v1.14.3 and experiencing the same issue as reported by others.

Context: I'm using the GPU Operator, version v23.9.0

From https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.14.3/internal/config/config.go#L124-L129:

func getLdConfigPath() string {
    if _, err := os.Stat("/sbin/ldconfig.real"); err == nil {
        return "@/sbin/ldconfig.real"
    }
    return "@/sbin/ldconfig"
}

If I ssh into the node and check the existence of /sbin/ldconfig.real:

stat /sbin/ldconfig.real
stat: cannot statx '/sbin/ldconfig.real': No such file or directory

But when looking at file /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml generated by the nvidia-container-toolkit DaemonSet:

# ...
[nvidia-container-cli]
  environment = []
  ldconfig = "@/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/"
# ...

It seems like function getLdConfigPath is not working as expected.

The only solution to fix this problem is creating a symlink, as stated by others: sudo ln -s /sbin/ldconfig /sbin/ldconfig.real.

@elezar is there another way to configure the ldconfig element in config.toml, or are we talking about a known issue in the getLdConfigPath function?

elezar commented 10 months ago

@cmontemuino are you also trying to run the GPU Operator in Kind? If not, what is your host OS on the node where the NVIDIA Container Toolkit is being configured?

There may be an issue with how we're generating the config - Especially in the context of the GPU Operator - where we are detecting ldconfig.real in the ubuntu-based container instead of on the host.

Note that deleting (or commenting) that option from the config should cause the right value to be detected when running the NVIDIA Container Runtime from the host.

cmontemuino commented 10 months ago

Hi @elezar, this is not KinD, but OracleLinux.

uname -r
5.14.0-284.30.1.el9_2.x86_64

We install Kubernetes (rancher/rke2) + the nvidia driver only. Then the gpu operator as an Argo CD application.

elezar commented 10 months ago

@cmontemuino other posters here have pointed out that they were using Kind. The symptom is the same though. Any host os where /sbin/ldconfig.real does not exist will show this behavior when using the default ubuntu-based base image.

We should definitely make this more resillient, but for now you could consider switching to the container-toolkit:{{VERSION}}-ubi8 image as a workaround.

UntouchedWagons commented 10 months ago

Just wanted to pop in to say that /sbin/ldconfig.real doesn't exist on Debian 12 either. I have to symlink it for the gpu stuff to work properly.

On Nov 13, 2023, 9:18 AM, at 9:18 AM, Evan Lezar @.***> wrote:

@cmontemuino other posters here have pointed out that they were using Kind. The symptom is the same though. Any host os where /sbin/ldconfig.real does not exist will show this behavior when using the default ubuntu-based base image.

We should definitely make this more resillient, but for now you could consider switching to the container-toolkit:{{VERSION}}-ubi8 image as a workaround.

-- Reply to this email directly or view it on GitHub: https://github.com/NVIDIA/nvidia-container-toolkit/issues/147#issuecomment-1808250370 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

elezar commented 10 months ago

Yes, most (if not all) non-Ubuntu distributions don't have the ldconfig -> ldconfig.real wrapper. This includes Debian 12. Since Debian is not an officially supported distribution under the GPU Operator this has not been a priority at present. Note that Kind uses debian-based images for the nodes which is why this is triggered there.

llajas commented 7 months ago

Just wanted to comment that I've been fighting all week to get a GPU working in my k3s cluster using containerd

The piece that made the entire thing come together was the missing symlink.

ln -s /sbin/ldconfig /sbin/ldconfig.real

Thank you!!!

My setup is as follows: 6 BM nodes, 3 of which are control plane nodes (Tiny FF so no PCIe slots for GPU access) 1 VM Node running on UnRaid w/ a RTX 2060 passed through.

OS: Fedora Linux 38 (Thirty Eight) Kernel: 6.7.4-100.fc38.x86_64 K3s: v1.28.3+k3s2 containerd: containerd://1.7.7-k3s1

elezar commented 7 months ago

@llajas which version of the NVIDIA Container Toolkit are you using?

llajas commented 7 months ago

@elezar

[root@metal6 ~]# nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.15.0-rc.3
commit: 93e15bc641896a9dc51f297c856c824bf1f45d86

I installed this using sudo yum install -y nvidia-container-toolkit. I see in retrospect that this is a pre-release version, but it is working well for what I'm leveraging (Audio/Video transcoding across a statefulset).

lindhe commented 6 months ago

Yes, most (if not all) non-Ubuntu distributions don't have the ldconfig -> ldconfig.real wrapper. This includes Debian 12. Since Debian is not an officially supported distribution under the GPU Operator this has not been a priority at present. Note that Kind uses debian-based images for the nodes which is why this is triggered there.

Good point! I'm on RHEL 8.9, which is supported, and I'm having the same issue. Was fixed by creating the symlink manually.

colinsauze commented 6 months ago

I have also been having this problem in Rocky 9.3 using Microk8s. The symlink work around fixes it.