NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.46k stars 264 forks source link

Unable to use CUDA Versions 11 when host has CUDA 10.2 Installed #214

Open zyh3826 opened 1 year ago

zyh3826 commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Also, before reporting a new issue, please make sure that:


1. Issue or feature description

system env:

centos7
NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2
nvidia-ctk 1.13.3 commit: c5a93b8d7063a8b1a04872a4e46d449e788ca4de

2. Steps to reproduce the issue

just docker run --rm --gpus all nvidia/cuda:cuda11.8.0-cudnn8-devel-ubuntu20.04

3. Information to attach (optional if deemed irrelevant)

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.8, please update your driver to a newer version, or use an earlier cuda container: unknown.

I was wondering that nvidia-docker is to inject the all of NVIDIA driver libs from the host into the container, why this happened. Please help me, thanks a lot

I0706 07:02:40.035096 2579104 nvc.c:376] initializing library context (version=1.13.3, build=f21fbe1a5f831936aab2796ebd08f5fb6d6c2df3) I0706 07:02:40.035733 2579104 nvc.c:350] using root / I0706 07:02:40.035779 2579104 nvc.c:351] using ldcache /etc/ld.so.cache I0706 07:02:40.035841 2579104 nvc.c:352] using unprivileged user 3021:3021 I0706 07:02:40.036214 2579104 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0706 07:02:40.036681 2579104 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment W0706 07:02:40.053457 2579105 nvc.c:273] failed to set inheritable capabilities W0706 07:02:40.053780 2579105 nvc.c:274] skipping kernel modules load due to failure I0706 07:02:40.056131 2579106 rpc.c:71] starting driver rpc service I0706 07:02:43.476112 2579120 rpc.c:71] starting nvcgo rpc service I0706 07:02:43.480061 2579104 nvc_info.c:797] requesting driver information with '' I0706 07:02:43.485304 2579104 nvc_info.c:175] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.440.33.01 I0706 07:02:43.485960 2579104 nvc_info.c:175] selecting /usr/lib64/libnvoptix.so.440.33.01 I0706 07:02:43.486246 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-tls.so.440.33.01 I0706 07:02:43.486447 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-rtcore.so.440.33.01 I0706 07:02:43.486638 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.440.33.01 I0706 07:02:43.486889 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-opticalflow.so.440.33.01 I0706 07:02:43.487146 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-opencl.so.440.33.01 I0706 07:02:43.487313 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-ml.so.440.33.01 I0706 07:02:43.487554 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-ifr.so.440.33.01 I0706 07:02:43.487801 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-glvkspirv.so.440.33.01 I0706 07:02:43.487984 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-glsi.so.440.33.01 I0706 07:02:43.488155 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-glcore.so.440.33.01 I0706 07:02:43.488333 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-fbc.so.440.33.01 I0706 07:02:43.488567 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-fatbinaryloader.so.440.33.01 I0706 07:02:43.488734 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-encode.so.440.33.01 I0706 07:02:43.488976 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-eglcore.so.440.33.01 I0706 07:02:43.489154 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-compiler.so.440.33.01 I0706 07:02:43.489331 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-cfg.so.440.33.01 I0706 07:02:43.489566 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-cbl.so.440.33.01 I0706 07:02:43.489735 2579104 nvc_info.c:175] selecting /usr/lib64/libnvidia-allocator.so.440.33.01 I0706 07:02:43.489984 2579104 nvc_info.c:175] selecting /usr/lib64/libnvcuvid.so.440.33.01 I0706 07:02:43.490789 2579104 nvc_info.c:175] selecting /usr/lib64/libcuda.so.440.33.01 I0706 07:02:43.491246 2579104 nvc_info.c:175] selecting /usr/lib64/libGLX_nvidia.so.440.33.01 I0706 07:02:43.491422 2579104 nvc_info.c:175] selecting /usr/lib64/libGLESv2_nvidia.so.440.33.01 I0706 07:02:43.491591 2579104 nvc_info.c:175] selecting /usr/lib64/libGLESv1_CM_nvidia.so.440.33.01 I0706 07:02:43.491764 2579104 nvc_info.c:175] selecting /usr/lib64/libEGL_nvidia.so.440.33.01 I0706 07:02:43.491977 2579104 nvc_info.c:175] selecting /usr/lib/vdpau/libvdpau_nvidia.so.440.33.01 I0706 07:02:43.492205 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-tls.so.440.33.01 I0706 07:02:43.492388 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-ptxjitcompiler.so.440.33.01 I0706 07:02:43.492636 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-opticalflow.so.440.33.01 I0706 07:02:43.492877 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-opencl.so.440.33.01 I0706 07:02:43.493060 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-ml.so.440.33.01 I0706 07:02:43.493297 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-ifr.so.440.33.01 I0706 07:02:43.493532 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-glvkspirv.so.440.33.01 I0706 07:02:43.493704 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-glsi.so.440.33.01 I0706 07:02:43.493868 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-glcore.so.440.33.01 I0706 07:02:43.494052 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-fbc.so.440.33.01 I0706 07:02:43.494281 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-fatbinaryloader.so.440.33.01 I0706 07:02:43.494444 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-encode.so.440.33.01 I0706 07:02:43.494674 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-eglcore.so.440.33.01 I0706 07:02:43.494846 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-compiler.so.440.33.01 I0706 07:02:43.495033 2579104 nvc_info.c:175] selecting /usr/lib/libnvidia-allocator.so.440.33.01 I0706 07:02:43.495275 2579104 nvc_info.c:175] selecting /usr/lib/libnvcuvid.so.440.33.01 I0706 07:02:43.495581 2579104 nvc_info.c:175] selecting /usr/lib/libcuda.so.440.33.01 I0706 07:02:43.495867 2579104 nvc_info.c:175] selecting /usr/lib/libGLX_nvidia.so.440.33.01 I0706 07:02:43.496049 2579104 nvc_info.c:175] selecting /usr/lib/libGLESv2_nvidia.so.440.33.01 I0706 07:02:43.496220 2579104 nvc_info.c:175] selecting /usr/lib/libGLESv1_CM_nvidia.so.440.33.01 I0706 07:02:43.496393 2579104 nvc_info.c:175] selecting /usr/lib/libEGL_nvidia.so.440.33.01 W0706 07:02:43.496483 2579104 nvc_info.c:401] missing library libnvidia-nscq.so W0706 07:02:43.496529 2579104 nvc_info.c:401] missing library libcudadebugger.so W0706 07:02:43.496571 2579104 nvc_info.c:401] missing library libnvidia-pkcs11.so W0706 07:02:43.496615 2579104 nvc_info.c:401] missing library libnvidia-pkcs11-openssl3.so W0706 07:02:43.496662 2579104 nvc_info.c:401] missing library libnvidia-nvvm.so W0706 07:02:43.496704 2579104 nvc_info.c:401] missing library libnvidia-ngx.so W0706 07:02:43.496746 2579104 nvc_info.c:405] missing compat32 library libnvidia-cfg.so W0706 07:02:43.496796 2579104 nvc_info.c:405] missing compat32 library libnvidia-nscq.so W0706 07:02:43.496837 2579104 nvc_info.c:405] missing compat32 library libcudadebugger.so W0706 07:02:43.496884 2579104 nvc_info.c:405] missing compat32 library libnvidia-pkcs11.so W0706 07:02:43.496926 2579104 nvc_info.c:405] missing compat32 library libnvidia-pkcs11-openssl3.so W0706 07:02:43.496982 2579104 nvc_info.c:405] missing compat32 library libnvidia-nvvm.so W0706 07:02:43.497024 2579104 nvc_info.c:405] missing compat32 library libnvidia-ngx.so W0706 07:02:43.497066 2579104 nvc_info.c:405] missing compat32 library libnvidia-rtcore.so W0706 07:02:43.497114 2579104 nvc_info.c:405] missing compat32 library libnvoptix.so W0706 07:02:43.497156 2579104 nvc_info.c:405] missing compat32 library libnvidia-cbl.so I0706 07:02:43.497696 2579104 nvc_info.c:301] selecting /usr/bin/nvidia-smi I0706 07:02:43.497818 2579104 nvc_info.c:301] selecting /usr/bin/nvidia-debugdump I0706 07:02:43.497929 2579104 nvc_info.c:301] selecting /usr/bin/nvidia-persistenced I0706 07:02:43.498115 2579104 nvc_info.c:301] selecting /usr/bin/nvidia-cuda-mps-control I0706 07:02:43.498234 2579104 nvc_info.c:301] selecting /usr/bin/nvidia-cuda-mps-server W0706 07:02:43.499486 2579104 nvc_info.c:427] missing binary nv-fabricmanager W0706 07:02:43.499586 2579104 nvc_info.c:470] missing firmware path /usr/lib/firmware/nvidia/440.33.01/gsp*.bin I0706 07:02:43.499671 2579104 nvc_info.c:560] listing device /dev/nvidiactl I0706 07:02:43.499698 2579104 nvc_info.c:560] listing device /dev/nvidia-uvm I0706 07:02:43.499725 2579104 nvc_info.c:560] listing device /dev/nvidia-uvm-tools I0706 07:02:43.499750 2579104 nvc_info.c:560] listing device /dev/nvidia-modeset W0706 07:02:43.500462 2579104 nvc_info.c:351] missing ipc path /var/run/nvidia-persistenced/socket W0706 07:02:43.500546 2579104 nvc_info.c:351] missing ipc path /var/run/nvidia-fabricmanager/socket W0706 07:02:43.500845 2579104 nvc_info.c:351] missing ipc path /tmp/nvidia-mps I0706 07:02:43.500882 2579104 nvc_info.c:853] requesting device information with '' I0706 07:02:43.513312 2579104 nvc_info.c:744] listing device /dev/nvidia0 (GPU-8caa2b82-cd80-2418-0aad-0e584b4ed5f7 at 00000000:3d:00.0) I0706 07:02:43.521320 2579104 nvc_info.c:744] listing device /dev/nvidia1 (GPU-6101ed84-8464-24f7-20bc-3bad1159f75b at 00000000:41:00.0) I0706 07:02:43.529586 2579104 nvc_info.c:744] listing device /dev/nvidia2 (GPU-c783acba-5bc9-c17b-e708-b6a713621b0b at 00000000:b1:00.0) I0706 07:02:43.538091 2579104 nvc_info.c:744] listing device /dev/nvidia3 (GPU-8b201437-7709-72f1-868a-a5fd9249d9b9 at 00000000:b5:00.0) NVRM version: 440.33.01 CUDA version: 10.2

Device Index: 0 Device Minor: 0 Model: GeForce RTX 2080 Ti Brand: GeForce GPU UUID: GPU-8caa2b82-cd80-2418-0aad-0e584b4ed5f7 Bus Location: 00000000:3d:00.0 Architecture: 7.5

Device Index: 1 Device Minor: 1 Model: GeForce RTX 2080 Ti Brand: GeForce GPU UUID: GPU-6101ed84-8464-24f7-20bc-3bad1159f75b Bus Location: 00000000:41:00.0 Architecture: 7.5

Device Index: 2 Device Minor: 2 Model: GeForce RTX 2080 Ti Brand: GeForce GPU UUID: GPU-c783acba-5bc9-c17b-e708-b6a713621b0b Bus Location: 00000000:b1:00.0 Architecture: 7.5

Device Index: 3 Device Minor: 3 Model: GeForce RTX 2080 Ti Brand: GeForce GPU UUID: GPU-8b201437-7709-72f1-868a-a5fd9249d9b9 Bus Location: 00000000:b5:00.0 Architecture: 7.5 I0706 07:02:43.539337 2579104 nvc.c:434] shutting down library context I0706 07:02:43.539465 2579120 rpc.c:95] terminating nvcgo rpc service I0706 07:02:43.540686 2579104 rpc.c:135] nvcgo rpc service terminated successfully I0706 07:02:43.962700 2579106 rpc.c:95] terminating driver rpc service I0706 07:02:43.962890 2579104 rpc.c:135] driver rpc service terminated successfully

 - [ ] Kernel version from `uname -a`

Linux ai184 3.10.0-1160.el7.x86_64 NVIDIA/nvidia-docker#1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

 - [ ] Any relevant kernel output lines from `dmesg`

[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-1160.el7.x86_64 root=UUID=2b559db5-ea31-4f09-b78b-c01284fee685 ro crashkernel=auto rhgb quiet [ 0.000000] Reserving 176MB of memory at 608MB for crashkernel (System RAM: 261789MB) [ 0.000000] Booting paravirtualized kernel on bare hardware [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-1160.el7.x86_64 root=UUID=2b559db5-ea31-4f09-b78b-c01284fee685 ro crashkernel=auto rhgb quiet [ 0.000000] Memory: 5533732k/270532608k available (7788k kernel code, 2460412k absent, 4621948k reserved, 5954k data, 1984k init) [ 0.000000] x86/pti: Unmapping kernel while in userspace [ 0.000000] Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl [ 0.094921] Spectre V2 : Mitigation: IBRS (kernel) [ 0.439208] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. [ 0.439211] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details. [ 3.696098] Loaded X.509 cert 'CentOS Linux kernel signing key: e1fdb0e2a7e861a1d1ca80a23dcf0dba3aa4adf5' [ 3.705844] BERT: Boot Error Record Table support is disabled. Enable it by using bert_enable as kernel parameter. [ 3.707394] Freeing unused kernel memory: 1984k freed [ 3.708190] Write protecting the kernel read-only data: 12288k [ 3.710256] Freeing unused kernel memory: 392k freed [ 3.712692] Freeing unused kernel memory: 536k freed [ 3.885703] systemd[1]: Starting Create list of required static device nodes for the current kernel... [ 3.891116] systemd[1]: Started Create list of required static device nodes for the current kernel. [ 4.473966] [TTM] Zone kernel: Available graphics memory: 131799676 kiB [ 16.458749] nvidia: loading out-of-tree module taints kernel. [ 16.458760] nvidia: module license 'NVIDIA' taints kernel. [ 16.458762] Disabling lock debugging due to kernel taint [ 16.576377] nvidia: module verification failed: signature and/or required key missing - tainting kernel [22429.972701] perf: interrupt took too long (2530 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [51878.620788] perf: interrupt took too long (3186 > 3162), lowering kernel.perf_event_max_sample_rate to 62000 [70787.553264] perf: interrupt took too long (3998 > 3982), lowering kernel.perf_event_max_sample_rate to 50000 [114608.867467] perf: interrupt took too long (4998 > 4997), lowering kernel.perf_event_max_sample_rate to 40000 [187099.384473] perf: interrupt took too long (6251 > 6247), lowering kernel.perf_event_max_sample_rate to 31000 [278468.371783] perf: interrupt took too long (7821 > 7813), lowering kernel.perf_event_max_sample_rate to 25000

 - [ ] Driver information from `nvidia-smi -a`
really long, is ask I will offer
 - [ ] Docker version from `docker version`
 `20.10.7`
 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

libnvidia-container1-1.13.3-1.x86_64 nvidia-container-toolkit-1.13.3-1.x86_64 nvidia-container-toolkit-base-1.13.3-1.x86_64 libnvidia-container-tools-1.13.3-1.x86_64 nvidia-docker2-2.13.0-1.noarch

 - [ ] NVIDIA container library version from `nvidia-container-cli -V`

cli-version: 1.13.3 lib-version: 1.13.3 build date: 2023-06-27T18:49+0000 build revision: f21fbe1a5f831936aab2796ebd08f5fb6d6c2df3 build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections


 - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
 - [ ] Docker command, image and tag used
asoans commented 1 year ago

Did you try updating the NVDIA driver and toolkit? You could also check the Docker Image is built with CUDA that works with your driver.

Reconfigure toolkit
sudo systemctl restart nvidia-container-runtime