NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.25k stars 245 forks source link

driver rpc error: failed to process request: unknown. #236

Open alexanderek opened 1 year ago

alexanderek commented 1 year ago

1. Issue or feature description

Hello!

I get an error when I start the container:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown.

OS: Oracle Linux 9 (RHCK 5.14.0-70)

2. Steps to reproduce the issue

docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

3. Information to attach (optional if deemed irrelevant)

I1027 11:13:07.177605 1596 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977) I1027 11:13:07.177638 1596 nvc.c:350] using root / I1027 11:13:07.177640 1596 nvc.c:351] using ldcache /etc/ld.so.cache I1027 11:13:07.177643 1596 nvc.c:352] using unprivileged user 1000:1000 I1027 11:13:07.177657 1596 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I1027 11:13:07.177712 1596 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment W1027 11:13:07.179736 1597 nvc.c:273] failed to set inheritable capabilities W1027 11:13:07.179858 1597 nvc.c:274] skipping kernel modules load due to failure I1027 11:13:07.180240 1598 rpc.c:71] starting driver rpc service I1027 11:13:07.402169 1602 rpc.c:71] starting nvcgo rpc service I1027 11:13:07.402896 1596 nvc_info.c:766] requesting driver information with '' I1027 11:13:07.403795 1596 nvc_info.c:173] selecting /usr/lib64/libnvoptix.so.520.61.05 I1027 11:13:07.403935 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-tls.so.520.61.05 I1027 11:13:07.404049 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-rtcore.so.520.61.05 I1027 11:13:07.404086 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.520.61.05 I1027 11:13:07.404134 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-opticalflow.so.520.61.05 I1027 11:13:07.404192 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-opencl.so.520.61.05 I1027 11:13:07.404315 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-ngx.so.520.61.05 I1027 11:13:07.404428 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-ml.so.520.61.05 I1027 11:13:07.404536 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-glvkspirv.so.520.61.05 I1027 11:13:07.404597 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-glsi.so.520.61.05 I1027 11:13:07.404660 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-glcore.so.520.61.05 I1027 11:13:07.404728 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-fbc.so.520.61.05 I1027 11:13:07.404800 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-encode.so.520.61.05 I1027 11:13:07.404869 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-eglcore.so.520.61.05 I1027 11:13:07.404935 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-compiler.so.520.61.05 I1027 11:13:07.405108 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-cfg.so.520.61.05 I1027 11:13:07.405217 1596 nvc_info.c:173] selecting /usr/lib64/libnvidia-allocator.so.520.61.05 I1027 11:13:07.405286 1596 nvc_info.c:173] selecting /usr/lib64/libnvcuvid.so.520.61.05 I1027 11:13:07.405396 1596 nvc_info.c:173] selecting /usr/lib64/libcudadebugger.so.520.61.05 I1027 11:13:07.405455 1596 nvc_info.c:173] selecting /usr/lib64/libcuda.so.520.61.05 I1027 11:13:07.405537 1596 nvc_info.c:173] selecting /usr/lib64/libGLX_nvidia.so.520.61.05 I1027 11:13:07.405593 1596 nvc_info.c:173] selecting /usr/lib64/libGLESv2_nvidia.so.520.61.05 I1027 11:13:07.405650 1596 nvc_info.c:173] selecting /usr/lib64/libGLESv1_CM_nvidia.so.520.61.05 I1027 11:13:07.405706 1596 nvc_info.c:173] selecting /usr/lib64/libEGL_nvidia.so.520.61.05 W1027 11:13:07.405751 1596 nvc_info.c:399] missing library libnvidia-nscq.so W1027 11:13:07.405784 1596 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W1027 11:13:07.405816 1596 nvc_info.c:399] missing library libnvidia-pkcs11.so W1027 11:13:07.405847 1596 nvc_info.c:399] missing library libvdpau_nvidia.so W1027 11:13:07.405879 1596 nvc_info.c:399] missing library libnvidia-ifr.so W1027 11:13:07.405910 1596 nvc_info.c:399] missing library libnvidia-cbl.so W1027 11:13:07.405942 1596 nvc_info.c:403] missing compat32 library libnvidia-ml.so W1027 11:13:07.405973 1596 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W1027 11:13:07.406005 1596 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W1027 11:13:07.406053 1596 nvc_info.c:403] missing compat32 library libcuda.so W1027 11:13:07.406087 1596 nvc_info.c:403] missing compat32 library libcudadebugger.so W1027 11:13:07.406124 1596 nvc_info.c:403] missing compat32 library libnvidia-opencl.so W1027 11:13:07.406156 1596 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so W1027 11:13:07.406188 1596 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so W1027 11:13:07.406220 1596 nvc_info.c:403] missing compat32 library libnvidia-allocator.so W1027 11:13:07.406252 1596 nvc_info.c:403] missing compat32 library libnvidia-compiler.so W1027 11:13:07.406283 1596 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W1027 11:13:07.406315 1596 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W1027 11:13:07.406346 1596 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W1027 11:13:07.406371 1596 nvc_info.c:403] missing compat32 library libnvidia-encode.so W1027 11:13:07.406387 1596 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so W1027 11:13:07.406395 1596 nvc_info.c:403] missing compat32 library libnvcuvid.so W1027 11:13:07.406412 1596 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so W1027 11:13:07.406428 1596 nvc_info.c:403] missing compat32 library libnvidia-glcore.so W1027 11:13:07.406436 1596 nvc_info.c:403] missing compat32 library libnvidia-tls.so W1027 11:13:07.406452 1596 nvc_info.c:403] missing compat32 library libnvidia-glsi.so W1027 11:13:07.406469 1596 nvc_info.c:403] missing compat32 library libnvidia-fbc.so W1027 11:13:07.406484 1596 nvc_info.c:403] missing compat32 library libnvidia-ifr.so W1027 11:13:07.406493 1596 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W1027 11:13:07.406509 1596 nvc_info.c:403] missing compat32 library libnvoptix.so W1027 11:13:07.406524 1596 nvc_info.c:403] missing compat32 library libGLX_nvidia.so W1027 11:13:07.406532 1596 nvc_info.c:403] missing compat32 library libEGL_nvidia.so W1027 11:13:07.406548 1596 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so W1027 11:13:07.406565 1596 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so W1027 11:13:07.406572 1596 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so W1027 11:13:07.406588 1596 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I1027 11:13:07.406765 1596 nvc_info.c:299] selecting /usr/bin/nvidia-smi I1027 11:13:07.406811 1596 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I1027 11:13:07.406853 1596 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I1027 11:13:07.406902 1596 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control I1027 11:13:07.406944 1596 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server W1027 11:13:07.407082 1596 nvc_info.c:425] missing binary nv-fabricmanager I1027 11:13:07.407179 1596 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/520.61.05/gsp.bin I1027 11:13:07.407264 1596 nvc_info.c:529] listing device /dev/nvidiactl I1027 11:13:07.407337 1596 nvc_info.c:529] listing device /dev/nvidia-uvm I1027 11:13:07.407400 1596 nvc_info.c:529] listing device /dev/nvidia-uvm-tools I1027 11:13:07.407448 1596 nvc_info.c:529] listing device /dev/nvidia-modeset W1027 11:13:07.407494 1596 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket W1027 11:13:07.407556 1596 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W1027 11:13:07.407615 1596 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I1027 11:13:07.407645 1596 nvc_info.c:822] requesting device information with '' I1027 11:13:07.413243 1596 nvc_info.c:713] listing device /dev/nvidia0 (GPU-787edf48-0558-0321-6279-83ee3c3929f8 at 00000000:01:00.0) NVRM version: 520.61.05 CUDA version: 11.8

Device Index: 0 Device Minor: 0 Model: Quadro P600 Brand: Quadro GPU UUID: GPU-787edf48-0558-0321-6279-83ee3c3929f8 Bus Location: 00000000:01:00.0 Architecture: 6.1 I1027 11:13:07.413421 1596 nvc.c:434] shutting down library context I1027 11:13:07.413510 1602 rpc.c:95] terminating nvcgo rpc service I1027 11:13:07.414005 1596 rpc.c:135] nvcgo rpc service terminated successfully I1027 11:13:07.637618 1598 rpc.c:95] terminating driver rpc service I1027 11:13:07.637869 1596 rpc.c:135] driver rpc service terminated successfully


 - [x] Kernel version from `uname -a`
` Linux linux-nv-test 5.14.0-70.26.1.0.1.el9_0.x86_64 NVIDIA/nvidia-docker#1 SMP PREEMPT Wed Sep 21 11:13:01 PDT 2022 x86_64 x86_64 x86_64 GNU/Linux`

 - [x] Any relevant kernel output lines from `dmesg`

[Thu Oct 27 14:29:30 2022] docker0: port 1(vethd809d07) entered blocking state [Thu Oct 27 14:29:30 2022] docker0: port 1(vethd809d07) entered disabled state [Thu Oct 27 14:29:30 2022] device vethd809d07 entered promiscuous mode [Thu Oct 27 14:29:30 2022] nvc:[driver][1676]: segfault at 38 ip 00007f380b81a89f sp 00007fff2d68df80 error 4 in libtirpc.so.3.0.0[7f380b804000+1c000] [Thu Oct 27 14:29:30 2022] Code: 44 00 00 4c 8d 35 a1 12 01 00 4c 89 f7 e8 29 ab fe ff e8 a4 ae fe ff 41 39 c4 7d 13 48 8b 05 80 12 01 00 49 63 fc 48 8d 04 f8 <48> 3b 18 74 14 5b 4c 89 f7 5d 41 5c 41 5d 41 5e e9 cc b4 fe ff 0f [Thu Oct 27 14:29:30 2022] docker0: port 1(vethd809d07) entered disabled state [Thu Oct 27 14:29:30 2022] device vethd809d07 left promiscuous mode [Thu Oct 27 14:29:30 2022] docker0: port 1(vethd809d07) entered disabled state


 - [x] Driver information from `nvidia-smi -a`

==============NVSMI LOG==============

Timestamp : Thu Oct 27 14:30:48 2022 Driver Version : 520.61.05 CUDA Version : 11.8

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : Quadro P600 Product Brand : Quadro Product Architecture : Pascal Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0422118005422 GPU UUID : GPU-787edf48-0558-0321-6279-83ee3c3929f8 Minor Number : 0 VBIOS Version : 86.07.3B.00.49 MultiGPU Board : No Board ID : 0x100 GPU Part Number : 900-5G212-1720-000 Module ID : 0 Inforom Version Image Version : G212.0501.00.01 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : Pass-Through Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1CB210DE Bus Id : 00000000:01:00.0 Sub System Id : 0x11BD10DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 36 % Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 2048 MiB Reserved : 47 MiB Used : 0 MiB Free : 2000 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 2 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 44 C GPU Shutdown Temp : 103 C GPU Slowdown Temp : 100 C GPU Max Operating Temp : N/A GPU Target Temperature : 83 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : N/A Power Draw : N/A Power Limit : N/A Default Power Limit : N/A Enforced Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 1328 MHz SM : 1328 MHz Memory : 2004 MHz Video : 1189 MHz Applications Clocks Graphics : 1328 MHz Memory : 2005 MHz Default Applications Clocks Graphics : 1328 MHz Memory : 2005 MHz Max Clocks Graphics : 1620 MHz SM : 1620 MHz Memory : 2005 MHz Video : 1455 MHz Max Customer Boost Clocks Graphics : 1620 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes : None


 - [x] Docker version from `docker version`

Client: Docker Engine - Community Version: 20.10.21 API version: 1.41 Go version: go1.18.7 Git commit: baeda1f Built: Tue Oct 25 18:02:16 2022 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.21 API version: 1.41 (minimum version 1.12) Go version: go1.18.7 Git commit: 3056208 Built: Tue Oct 25 18:00:01 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.9 GitCommit: 1c90a442489720eec95342e1789ee8a5e1b9536f runc: Version: 1.1.4 GitCommit: v1.1.4-0-g5fd4c4d docker-init: Version: 0.19.0 GitCommit: de40ad0


 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

nvidia-driver-NVML-520.61.05-1.el9.x86_64 nvidia-libXNVCtrl-520.61.05-1.el9.x86_64 nvidia-driver-libs-520.61.05-1.el9.x86_64 nvidia-driver-cuda-libs-520.61.05-1.el9.x86_64 nvidia-driver-NvFBCOpenGL-520.61.05-1.el9.x86_64 nvidia-driver-devel-520.61.05-1.el9.x86_64 nvidia-libXNVCtrl-devel-520.61.05-1.el9.x86_64 nvidia-persistenced-520.61.05-1.el9.x86_64 nvidia-driver-cuda-520.61.05-1.el9.x86_64 dnf-plugin-nvidia-2.0-1.el9.noarch kmod-nvidia-latest-dkms-520.61.05-1.el9.x86_64 nvidia-kmod-common-520.61.05-1.el9.noarch nvidia-driver-520.61.05-1.el9.x86_64 nvidia-modprobe-520.61.05-1.el9.x86_64 nvidia-settings-520.61.05-1.el9.x86_64 nvidia-xconfig-520.61.05-1.el9.x86_64 nvidia-container-toolkit-base-1.11.0-1.x86_64 libnvidia-container1-1.11.0-1.x86_64 libnvidia-container-tools-1.11.0-1.x86_64 nvidia-container-toolkit-1.11.0-1.x86_64 nvidia-docker2-2.11.0-1.noarch


 - [x] NVIDIA container library version from `nvidia-container-cli -V`

build date: 2022-09-06T09:25+00:00 build revision: c8f267be0bac1c654d59ad4ea5df907141149977 build compiler: gcc 8.5.0 20210514 (Red Hat 8.5.0-15) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -I/usr/include/tirpc -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections


- [x] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))

/var/log/nvidia-container-toolkit.log

I1027 11:10:05.400586 1467 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977) I1027 11:10:05.400624 1467 nvc.c:350] using root / I1027 11:10:05.400627 1467 nvc.c:351] using ldcache /etc/ld.so.cache I1027 11:10:05.400630 1467 nvc.c:352] using unprivileged user 65534:65534 I1027 11:10:05.400643 1467 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I1027 11:10:05.400700 1467 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment I1027 11:10:05.402899 1472 nvc.c:278] loading kernel module nvidia I1027 11:10:05.403017 1472 nvc.c:282] running mknod for /dev/nvidiactl I1027 11:10:05.403047 1472 nvc.c:286] running mknod for /dev/nvidia0 I1027 11:10:05.403077 1472 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps I1027 11:10:05.408697 1472 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config I1027 11:10:05.408797 1472 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor I1027 11:10:05.410562 1472 nvc.c:296] loading kernel module nvidia_uvm I1027 11:10:05.410598 1472 nvc.c:300] running mknod for /dev/nvidia-uvm I1027 11:10:05.410653 1472 nvc.c:305] loading kernel module nvidia_modeset I1027 11:10:05.410681 1472 nvc.c:309] running mknod for /dev/nvidia-modeset I1027 11:10:05.410980 1473 rpc.c:71] starting driver rpc service I1027 11:10:05.411776 1473 rpc.c:95] terminating driver rpc service I1027 11:10:05.479960 1467 rpc.c:135] driver rpc service terminated with signal 11 I1027 11:10:05.480022 1467 nvc.c:434] shutting down library context



- [x] Docker command, image and tag used
`docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi`
elezar commented 1 year ago

@alexanderek since Oracle Linux is not an officially supported distribution, may I ask which packages you installed?

alexanderek commented 1 year ago

@elezar I used packages for the RHEL 9.

NVIDIA Drivers:

sudo dnf install -y epel-release
sudo dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)

distribution=rhel9
ARCH=$( /bin/arch )

sudo dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$distribution/${ARCH}/cuda-$distribution.repo 
sudo dnf module install nvidia-driver:latest-dkms/default

NVIDIA Container Toolkit:

distribution=rhel9.0

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

sudo dnf clean expire-cache --refresh
sudo dnf install -y nvidia-docker2

Installed:

nvidia-driver-NVML-520.61.05-1.el9.x86_64
nvidia-libXNVCtrl-520.61.05-1.el9.x86_64
nvidia-driver-libs-520.61.05-1.el9.x86_64
nvidia-driver-cuda-libs-520.61.05-1.el9.x86_64
nvidia-driver-NvFBCOpenGL-520.61.05-1.el9.x86_64
nvidia-driver-devel-520.61.05-1.el9.x86_64
nvidia-libXNVCtrl-devel-520.61.05-1.el9.x86_64
nvidia-persistenced-520.61.05-1.el9.x86_64
nvidia-driver-cuda-520.61.05-1.el9.x86_64
dnf-plugin-nvidia-2.0-1.el9.noarch
kmod-nvidia-latest-dkms-520.61.05-1.el9.x86_64
nvidia-kmod-common-520.61.05-1.el9.noarch
nvidia-driver-520.61.05-1.el9.x86_64
nvidia-modprobe-520.61.05-1.el9.x86_64
nvidia-settings-520.61.05-1.el9.x86_64
nvidia-xconfig-520.61.05-1.el9.x86_64
nvidia-container-toolkit-base-1.11.0-1.x86_64
libnvidia-container1-1.11.0-1.x86_64
libnvidia-container-tools-1.11.0-1.x86_64
nvidia-container-toolkit-1.11.0-1.x86_64
nvidia-docker2-2.11.0-1.noarch

I know that RHEL officially only supports Podman. But I had a hope to get docker to work with Nvidia. )

JBrenesS commented 1 year ago

@elezar I'm struggling with the same error, Do you have any suggestions to fix it? In my case, When I installed k3s on my machine It works, but once I remove k3s the error backs.

elezar commented 1 year ago

The initial nvidia-smi output shows that persistence mode is disabled. Could you enable it and try again?

michikite commented 1 year ago

I had the same error after installing Citrix. Removing icaclient resolved it for me. sudo apt remove icaclient && sudo apt purge icaclient It had something to do with nvidia-persistenced running "citrix-nvidia-presistenced" I guess.

eliminyro commented 1 year ago

I have a very similar setup, only the GPU is A4000. The error is the same, to the letter, even the messages in dmesg. The issue occurs with the latest driver, SELinux disabled, all the packages from above installed, persistent mode enabled. The host is essentially a clean machine, nothing aside docker and nvidia packages were installed. Looks like a problem with a driver or the toolkit to me...

ngseer commented 1 year ago

I have the same symptoms. A month ago (or so) it worked just fine. I'm using GTX 750 Ti on an Arch Linux server with 6.1.12 kernel and 525.89.02 driver, the latest container toolkit. No Citrix, persistence mode is on.

eliminyro commented 1 year ago

@ngseer I actually came from Arch (with the same versions you named) because I thought changing the distro might help. I thought wrong.

quoing commented 1 year ago

Same issue Arch linux, Quadro P400.. nvc:[driver][78003]: segfault in libtirpc.so.3.0.0

Update: I downgraded Nvidia driver (525.89.02 -> 525.78.01) and docker (23.0.1 -> 20.10.22) which allows me to properly run nvidia-smi in container..

# docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Sat Mar  4 13:17:09 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P400         Off  | 00000000:00:09.0 Off |                  N/A |
| 29%   43C    P0    N/A /  30W |      0MiB /  2048MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
quoing commented 1 year ago

Same issue Arch linux, Quadro P400.. nvc:[driver][78003]: segfault in libtirpc.so.3.0.0

Update: I downgraded Nvidia driver (525.89.02 -> 525.78.01) and docker (23.0.1 -> 20.10.22) which allows me to properly run nvidia-smi in container..

# docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Sat Mar  4 13:17:09 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P400         Off  | 00000000:00:09.0 Off |                  N/A |
| 29%   43C    P0    N/A /  30W |      0MiB /  2048MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Update 2: upgraded docker.. and nvidia-smi in container still runnable.. so probably just nvidia version issue

]# docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Sat Mar  4 13:35:44 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
ngseer commented 1 year ago

I can confirm that the 525.78.01 Nvidia driver can be used as a temporary workaround.

ngseer commented 1 year ago

It stopped working after another system upgrade (I've put Nvidia and kernel packages to an ignore list so they were not affected). I've already tried to downgrade Docker and container toolkit and upgrade Nvidia drivers to the latest version — still no luck. That almost brought me back to 2012 when Arch was fun rather than stable :)

So the issue still persists.

@quoing how's it going on your side? Have you upgraded any packages ever since?

quoing commented 1 year ago

@ngseer

My nvidia drivers are updated, but I'm still running older docker - I didn't try recently to upgrade, so can't tell if something newer than 23.0.1 is working or not.

$ yay -Q docker nvidia-lts linux-lts
docker 1:20.10.22-1
nvidia-lts 1:530.41.03-3
linux-lts 6.1.23-1
$ yay -Ss docker | grep '/docker '
community/docker 1:23.0.3-1 (26.2 MiB 104.4 MiB) (Installed: 1:20.10.22-1)
ngseer commented 1 year ago

@quoing thanks a lot! Works like a charm with Docker 1:20.10.23-1 and the latest versions of everything else.

chrmat commented 11 months ago

I had the same error after installing Citrix. Removing icaclient resolved it for me. sudo apt remove icaclient && sudo apt purge icaclient It had something to do with nvidia-persistenced running "citrix-nvidia-presistenced" I guess.

Perfect, after lots of searching this solved the issue for me!

chaddupuis commented 11 months ago

I had the same error after installing Citrix. Removing icaclient resolved it for me. sudo apt remove icaclient && sudo apt purge icaclient It had something to do with nvidia-persistenced running "citrix-nvidia-presistenced" I guess.

Many thanks for this - saved me from even more hours getting this going!!! Was seeing an issue on ubuntu22.04 with newer drivers - "nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown." - did apt remove icaclient - and ran the same command again ( docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.1.0-devel-ubuntu22.04 nvidia-smi ) and all was well.

I then re-installed the icaclient and it still worked (never did the purge either). So far it looks like just moving it out of the way for a minute helped....

kosimas commented 6 months ago

I have the same issue. icaclient is not installed and nvidia-persistenced.service is enabled and running.

$ neofetch
                     ./o.                  
                   ./sssso-                -----------
                 `:osssssss+-              OS: EndeavourOS Linux x86_64
               `:+sssssssssso/.            Kernel: 6.6.22-1-lts
             `-/ossssssssssssso/.          Uptime: 17 mins
           `-/+sssssssssssssssso+:`        Packages: 1098 (pacman)
         `-:/+sssssssssssssssssso+/.       Shell: bash 5.2.26
       `.://osssssssssssssssssssso++-      Resolution: 1920x1080
      .://+ssssssssssssssssssssssso++:     Terminal: /dev/pts/0
    .:///ossssssssssssssssssssssssso++:    CPU: Intel i7-7700 (8) @ 4.200GHz
  `:////ssssssssssssssssssssssssssso+++.   GPU: NVIDIA GeForce RTX 3060 Ti
`-////+ssssssssssssssssssssssssssso++++-   GPU: NVIDIA GeForce RTX 3060 Ti Lite Hash Rate
 `..-+oosssssssssssssssssssssssso+++++/`   GPU: Intel HD Graphics 630
   ./++++++++++++++++++++++++++++++/:.     GPU: NVIDIA GeForce RTX 3060 Ti Lite Hash Rate
  `:::::::::::::::::::::::::------``       Memory: 1086MiB / 3613MiB
$ pacman -Qs docker
local/docker 1:26.0.0-1
    Pack, ship and run any application as a lightweight container
$ pacman -Qs nvidia
local/cuda 12.4.0-2
    NVIDIA's GPU programming toolkit
local/cuda-tools 12.4.0-2
    NVIDIA's GPU programming toolkit (extra tools: nvvp, nsight)
local/egl-wayland 2:1.1.13-1
    EGLStream-based Wayland external platform
local/libnvidia-container 1.14.6-3
    NVIDIA container runtime library
local/libvdpau 1.5-2
    Nvidia VDPAU library
local/libxnvctrl 550.67-1
    NVIDIA NV-CONTROL X extension
local/nvidia-container-toolkit 1.14.6-3
    NVIDIA container runtime toolkit
local/nvidia-dkms 550.67-1
    NVIDIA drivers - module sources
local/nvidia-hook 1.5-1
    pacman hook for nvidia
local/nvidia-inst 24-1
    Script to setup nvidia drivers (dkms version) in EndeavourOS
local/nvidia-settings 550.67-1
    Tool for configuring the NVIDIA graphics driver
local/nvidia-utils 550.67-1
    NVIDIA drivers utilities
local/opencl-nvidia 550.67-1
    OpenCL implemention for NVIDIA
$ nvidia-smi
Tue Mar 26 16:17:39 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 Ti     On  |   00000000:02:00.0 Off |                  N/A |
|  0%   44C    P8             13W /  200W |       2MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060 Ti     On  |   00000000:06:00.0 Off |                  N/A |
|  0%   49C    P8             18W /  200W |       2MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3060 Ti     On  |   00000000:08:00.0 Off |                  N/A |
|  0%   41C    P8             15W /  200W |       2MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
cargaona commented 4 months ago

Got the same error.

I checked dmesg and found the following messages


[ 1000.677101] __vm_enough_memory: pid: 10152, comm: nvc:[driver], not enough memory for the allocation
[ 1000.677112] __vm_enough_memory: pid: 10152, comm: nvc:[driver], not enough memory for the allocation
[ 1000.677115] __vm_enough_memory: pid: 10152, comm: nvc:[driver], not enough memory for the allocation
[ 1000.677218] nvc:[driver][10152]: segfault at 30 ip 0000742b9c28ee21 sp 00007ffe006a4450 error 4 in libtirpc.so.3.0.0[742b9c27a000+1b000] likely on CPU 0 (core 0, socket 0)
[ 1000.677232] Code: 00 4c 8d 35 41 0d 01 00 4c 89 f7 ff 15 58 f9 00 00 ff 15 4a fb 00 00 41 39 c4 7d 17 48 8b 05 1e 0d 01 00 49 63 fc 48 8d 04 f8 <48> 3b 18 0f 84 06 01

Not enough memory but I had plenty. Started freeing up the memory but even when my box was using only 300mb of ram got the same message.

sudo sysctl -w vm.overcommit_memory=1

Did the trick for me.

https://github.com/vmware/photon/issues/1461#issuecomment-1853495295