NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.25k stars 2.03k forks source link

Nvidia-docker issues with jupyer-notebooks #1042

Closed kapara-jpg closed 5 years ago

kapara-jpg commented 5 years ago

1. Issue or feature description

When trying to deploy jupyter-notebook with jupyterhub I get this error:

2019-08-08 10:02:17+00:00 [Warning] Error: failed to start container "notebook": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=no-gpu-has-1MiB-to-run --compute --utility --require=cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 --pid=13507 /var/lib/docker/overlay2/7e1caede1d313bdd0a23dbc1c841b130067c512115c9d2a15af64263d8c12c1e/merged]\\\\nnvidia-container-cli: device error: unknown device id: no-gpu-has-1MiB-to-run\\\\n\\\"\"": unknown

Im useing gpushare-device-plugin by Aliyun (Alibaba Cloud) Container Service (link)

this issues happens only when trying to deploy the notebook through k8s.

2. Steps to reproduce the issue

every time I try to create new note-book image

3. Information to attach (optional if deemed irrelevant)

I0808 10:14:55.825632 21204 nvc.c:281] initializing library context (version=1.0.2, build=ff40da533db929bf515aca59ba4c701a65a35e6b) I0808 10:14:55.825762 21204 nvc.c:255] using root / I0808 10:14:55.825781 21204 nvc.c:256] using ldcache /etc/ld.so.cache I0808 10:14:55.825810 21204 nvc.c:257] using unprivileged user 65534:65534 I0808 10:14:55.828359 21205 nvc.c:191] loading kernel module nvidia I0808 10:14:55.828999 21205 nvc.c:203] loading kernel module nvidia_uvm I0808 10:14:55.829430 21205 nvc.c:211] loading kernel module nvidia_modeset I0808 10:14:55.830193 21206 driver.c:133] starting driver service I0808 10:14:55.854227 21204 nvc_info.c:434] requesting driver information with '' I0808 10:14:55.854658 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.418.67 I0808 10:14:55.854777 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.418.67 over /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.418.67 I0808 10:14:55.854873 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.67 I0808 10:14:55.855004 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.418.67 I0808 10:14:55.855135 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.67 I0808 10:14:55.855224 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.418.67 I0808 10:14:55.855360 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.418.67 I0808 10:14:55.855487 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.418.67 I0808 10:14:55.855577 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.418.67 I0808 10:14:55.855667 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.418.67 I0808 10:14:55.855791 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.67 I0808 10:14:55.855881 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.418.67 I0808 10:14:55.856004 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.418.67 I0808 10:14:55.856189 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.418.67 I0808 10:14:55.856287 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.418.67 I0808 10:14:55.856420 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.418.67 I0808 10:14:55.856717 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.418.67 I0808 10:14:55.856902 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.418.67 I0808 10:14:55.857001 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.418.67 I0808 10:14:55.857093 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.418.67 I0808 10:14:55.857180 21204 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.418.67 W0808 10:14:55.857241 21204 nvc_info.c:299] missing library libvdpau_nvidia.so W0808 10:14:55.857260 21204 nvc_info.c:303] missing compat32 library libnvidia-ml.so W0808 10:14:55.857282 21204 nvc_info.c:303] missing compat32 library libnvidia-cfg.so W0808 10:14:55.857301 21204 nvc_info.c:303] missing compat32 library libcuda.so W0808 10:14:55.857323 21204 nvc_info.c:303] missing compat32 library libnvidia-opencl.so W0808 10:14:55.857338 21204 nvc_info.c:303] missing compat32 library libnvidia-ptxjitcompiler.so W0808 10:14:55.857357 21204 nvc_info.c:303] missing compat32 library libnvidia-fatbinaryloader.so W0808 10:14:55.857378 21204 nvc_info.c:303] missing compat32 library libnvidia-compiler.so W0808 10:14:55.857394 21204 nvc_info.c:303] missing compat32 library libvdpau_nvidia.so W0808 10:14:55.857416 21204 nvc_info.c:303] missing compat32 library libnvidia-encode.so W0808 10:14:55.857435 21204 nvc_info.c:303] missing compat32 library libnvidia-opticalflow.so W0808 10:14:55.857457 21204 nvc_info.c:303] missing compat32 library libnvcuvid.so W0808 10:14:55.857471 21204 nvc_info.c:303] missing compat32 library libnvidia-eglcore.so W0808 10:14:55.857488 21204 nvc_info.c:303] missing compat32 library libnvidia-glcore.so W0808 10:14:55.857510 21204 nvc_info.c:303] missing compat32 library libnvidia-tls.so W0808 10:14:55.857526 21204 nvc_info.c:303] missing compat32 library libnvidia-glsi.so W0808 10:14:55.857547 21204 nvc_info.c:303] missing compat32 library libnvidia-fbc.so W0808 10:14:55.857566 21204 nvc_info.c:303] missing compat32 library libnvidia-ifr.so W0808 10:14:55.857588 21204 nvc_info.c:303] missing compat32 library libGLX_nvidia.so W0808 10:14:55.857602 21204 nvc_info.c:303] missing compat32 library libEGL_nvidia.so W0808 10:14:55.857619 21204 nvc_info.c:303] missing compat32 library libGLESv2_nvidia.so W0808 10:14:55.857641 21204 nvc_info.c:303] missing compat32 library libGLESv1_CM_nvidia.so I0808 10:14:55.858158 21204 nvc_info.c:229] selecting /usr/bin/nvidia-smi I0808 10:14:55.858213 21204 nvc_info.c:229] selecting /usr/bin/nvidia-debugdump I0808 10:14:55.858270 21204 nvc_info.c:229] selecting /usr/bin/nvidia-persistenced I0808 10:14:55.858323 21204 nvc_info.c:229] selecting /usr/bin/nvidia-cuda-mps-control I0808 10:14:55.858376 21204 nvc_info.c:229] selecting /usr/bin/nvidia-cuda-mps-server I0808 10:14:55.858436 21204 nvc_info.c:366] listing device /dev/nvidiactl I0808 10:14:55.858454 21204 nvc_info.c:366] listing device /dev/nvidia-uvm I0808 10:14:55.858475 21204 nvc_info.c:366] listing device /dev/nvidia-uvm-tools I0808 10:14:55.858495 21204 nvc_info.c:366] listing device /dev/nvidia-modeset I0808 10:14:55.858569 21204 nvc_info.c:270] listing ipc /run/nvidia-persistenced/socket W0808 10:14:55.858612 21204 nvc_info.c:274] missing ipc /tmp/nvidia-mps I0808 10:14:55.858628 21204 nvc_info.c:490] requesting device information with '' I0808 10:14:55.864650 21204 nvc_info.c:520] listing device /dev/nvidia0 (GPU-d5951a2f-baab-0503-82b1-920531e013bc at 00000000:02:00.0) NVRM version: 418.67 CUDA version: 10.1

Device Index: 0 Device Minor: 0 Model: Quadro M4000 Brand: Quadro GPU UUID: GPU-d5951a2f-baab-0503-82b1-920531e013bc Bus Location: 00000000:02:00.0 Architecture: 5.2 I0808 10:14:55.864729 21204 nvc.c:318] shutting down library context I0808 10:14:55.865073 21206 driver.c:192] terminating driver service I0808 10:14:55.873420 21204 driver.c:233] driver service terminated successfully


 - [ ] Kernel version from `uname -a`
`Linux gpu-server 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux`
 - [ ] Driver information from `nvidia-smi -a`
`==============NVSMI LOG==============

Timestamp                           : Thu Aug  8 10:16:40 2019
Driver Version                      : 418.67
CUDA Version                        : 10.1

Attached GPUs                       : 1
GPU 00000000:02:00.0
    Product Name                    : Quadro M4000
    Product Brand                   : Quadro
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322316028845
    GPU UUID                        : GPU-d5951a2f-baab-0503-82b1-920531e013bc
    Minor Number                    : 0
    VBIOS Version                   : 84.04.88.00.61
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G400.0501.01.03
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x13F110DE
        Bus Id                      : 00000000:02:00.0
        Sub System Id               : 0x1153103C
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays Since Reset         : 0
        Replay Number Rollovers     : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 46 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : N/A
            HW Power Brake Slowdown : N/A
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 8124 MiB
        Used                        : 1 MiB
        Free                        : 8123 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 4 MiB
        Free                        : 252 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 3 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 37 C
        GPU Shutdown Temp           : 104 C
        GPU Slowdown Temp           : 99 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 20.11 W
        Power Limit                 : 120.00 W
        Default Power Limit         : 120.00 W
        Enforced Power Limit        : 120.00 W
        Min Power Limit             : 10.00 W
        Max Power Limit             : 120.00 W
    Clocks
        Graphics                    : 135 MHz
        SM                          : 135 MHz
        Memory                      : 324 MHz
        Video                       : 405 MHz
    Applications Clocks
        Graphics                    : 772 MHz
        Memory                      : 3005 MHz
    Default Applications Clocks
        Graphics                    : 772 MHz
        Memory                      : 3005 MHz
    Max Clocks
        Graphics                    : 772 MHz
        SM                          : 772 MHz
        Memory                      : 3005 MHz
        Video                       : 710 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None`
 - [ ] Docker version from `docker version`
`Client:
 Version:           18.09.8
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        0dd43dd87f
 Built:             Wed Jul 17 17:40:56 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.8
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       0dd43dd
  Built:            Wed Jul 17 17:07:25 2019
  OS/Arch:          linux/amd64
  Experimental:     false
`
 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`
`Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                  Version         Architecture    Description
+++-=====================-===============-===============-================================================
un  libgldispatch0-nvidia <none>          <none>          (no description available)
ii  libnvidia-cfg1-418:am 418.67-0ubuntu1 amd64           NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any    <none>          <none>          (no description available)
un  libnvidia-common      <none>          <none>          (no description available)
ii  libnvidia-common-418  418.67-0ubuntu1 all             Shared files used by the NVIDIA libraries
ii  libnvidia-compute-418 418.67-0ubuntu1 amd64           NVIDIA libcompute package
ii  libnvidia-container-t 1.0.2-1         amd64           NVIDIA container runtime library (command-line t
ii  libnvidia-container1: 1.0.2-1         amd64           NVIDIA container runtime library
un  libnvidia-decode      <none>          <none>          (no description available)
ii  libnvidia-decode-418: 418.67-0ubuntu1 amd64           NVIDIA Video Decoding runtime libraries
un  libnvidia-encode      <none>          <none>          (no description available)
ii  libnvidia-encode-418: 418.67-0ubuntu1 amd64           NVENC Video Encoding runtime library
un  libnvidia-fbc1        <none>          <none>          (no description available)
ii  libnvidia-fbc1-418:am 418.67-0ubuntu1 amd64           NVIDIA OpenGL-based Framebuffer Capture runtime 
un  libnvidia-gl          <none>          <none>          (no description available)
ii  libnvidia-gl-418:amd6 418.67-0ubuntu1 amd64           NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and V
un  libnvidia-ifr1        <none>          <none>          (no description available)
ii  libnvidia-ifr1-418:am 418.67-0ubuntu1 amd64           NVIDIA OpenGL-based Inband Frame Readback runtim
un  nvidia-304            <none>          <none>          (no description available)
un  nvidia-340            <none>          <none>          (no description available)
un  nvidia-384            <none>          <none>          (no description available)
un  nvidia-390            <none>          <none>          (no description available)
ii  nvidia-compute-utils- 418.67-0ubuntu1 amd64           NVIDIA compute utilities
ii  nvidia-container-runt 3.0.0-1         amd64           NVIDIA container runtime
ii  nvidia-container-runt 1.4.0-1         amd64           NVIDIA container runtime hook
ii  nvidia-dkms-418       418.67-0ubuntu1 amd64           NVIDIA DKMS package
un  nvidia-dkms-kernel    <none>          <none>          (no description available)
un  nvidia-docker         <none>          <none>          (no description available)
ii  nvidia-docker2        2.1.0-1         all             nvidia-docker CLI wrapper
ii  nvidia-driver-418     418.67-0ubuntu1 amd64           NVIDIA driver metapackage
un  nvidia-driver-binary  <none>          <none>          (no description available)
un  nvidia-kernel-common  <none>          <none>          (no description available)
ii  nvidia-kernel-common- 418.67-0ubuntu1 amd64           Shared files used with the kernel module
un  nvidia-kernel-source  <none>          <none>          (no description available)
ii  nvidia-kernel-source- 418.67-0ubuntu1 amd64           NVIDIA kernel source package
un  nvidia-legacy-340xx-v <none>          <none>          (no description available)
ii  nvidia-modprobe       418.67-0ubuntu1 amd64           Load the NVIDIA kernel driver and create device 
un  nvidia-opencl-icd     <none>          <none>          (no description available)
un  nvidia-persistenced   <none>          <none>          (no description available)
ii  nvidia-prime          0.8.8.2         all             Tools to enable NVIDIA's Prime
ii  nvidia-settings       418.67-0ubuntu1 amd64           Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binar <none>          <none>          (no description available)
un  nvidia-smi            <none>          <none>          (no description available)
un  nvidia-utils          <none>          <none>          (no description available)
ii  nvidia-utils-418      418.67-0ubuntu1 amd64           NVIDIA driver support binaries
un  nvidia-vdpau-driver   <none>          <none>          (no description available)
ii  xserver-xorg-video-nv 418.67-0ubuntu1 amd64           NVIDIA binary Xorg driver
dpkg-query: no packages found matching *nvidia*rpm
dpkg-query: no packages found matching -qa
`
 - [ ] NVIDIA container library version from `nvidia-container-cli -V`
`version: 1.0.2
build date: 2019-03-26T03:58+00:00
build revision: ff40da533db929bf515aca59ba4c701a65a35e6b
build compiler: x86_64-linux-gnu-gcc-7 7.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
`
 - [ ] Docker command, image and tag used
     `FROM nvidia/cuda
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
RUN  apt update
RUN apt install -y software-properties-common
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt install -y  python3.7
RUN apt-get -y install python3-pip
RUN apt-get update
RUN pip3 install jupyter cupy-cuda101

CMD jupyter notebook --ip=0.0.0.0 --allow-root

`
RenaudWasTaken commented 5 years ago

Hello!

This is an issue with the plugin, feel free to open an issue with them :) It seems to try and isolate a non existing GPU device=no-gpu-has-1MiB-to-run.

My guess is that you have an error when trying to use that plugin :)