NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.21k stars 2.03k forks source link

Couldn't find libnvidia-ml.so library in your system #859

Closed clythersHackers closed 5 years ago

clythersHackers commented 5 years ago

1. Issue or feature description

Sorry if this is a duplicate, but I have been following circular links marking the same issue as a duplicate without actually finding a solution. Warning I'm new to this so the more I patch the more worried I get that I have made a mess and will have to start again from scratch. Right now I have the nvidia docker and container RPM's installed corresponding to my docker version 1.13, so pretty clean and pristine from the RPM installation.

There appears to be a conflict between the nvidia- installed docker.daemon and the default RHEL/CentOs daemon.json, apparently it doesn't like command line specification and daemon.json configuration. It looks like it could be very simple to fix.

In short:

systemctl status docker .....

/etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration ... runtimes: (from flag: [oci], from file: map[nvidia:map[path:/usr/bin/nvidia-container-runtime runtimeArgs:[]]])

2. Steps to reproduce the issue

sudo systemctl restart docker

3. Information to attach (optional if deemed irrelevant)

I tried to clear daemon.json to contain only {}, Then docker runs fine but with default runtime oci and cannot start nvidia GPU images

systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2018-11-14 14:06:24 GMT; 22min ago Docs: http://docs.docker.com Main PID: 2822 (dockerd-current) Tasks: 18 (limit: 8192) Memory: 78.2M CGroup: /system.slice/docker.service └─2822 /usr/bin/dockerd-current --add-runtime oci=/usr/libexec/docker/docker-runc-current --default-runtime=oci --authorization-plugin=rhel-push-plugin --containerd /run/containerd.>

flx42 commented 5 years ago

Looks like you are running Red Hat's fork of docker, so you should follow these instructions instead: https://github.com/NVIDIA/nvidia-docker#centos-7-docker-rhel-7475-docker You won't install the nvidia-docker2 package, so no daemon.json conflict.

clythersHackers commented 5 years ago

Thanks, so I carefully removed what I believe is associated with docker2 nvidia-docker2-2.0.3-1.docker1.13.1.noarch nvidia-container-runtime-hook-1.3.0-1.x86_64 nvidia-container-runtime-2.0.0-1.docker1.13.1.x86_64

Then ran the second procedure in your link... up to:

yum install -y nvidia-container-runtime-hook

This went OK, dockerd service starts OK, but on running the test...

Get errors: (was I wrong to remove the container RPMs?)

docker run --rm nvidia/cuda:9.0-base nvidia-smi container_linux.go:247: starting container process caused "process_linux.go:339: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=4517 /var/lib/docker/overlay2/6196a356e3cb283ade76732913a84ce62583e3b0dd4020bcf60042cfbc5b249e/merged]\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 1\n\"" /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:339: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=4517 /var/lib/docker/overlay2/6196a356e3cb283ade76732913a84ce62583e3b0dd4020bcf60042cfbc5b249e/merged]\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 1\n\"".

RenaudWasTaken commented 5 years ago

Hello!

Do you mind giving a bit more information so we can help you debug this:

clythersHackers commented 5 years ago

Hi, thanks for your response, note this is the situation now after following the previous instructions.

uname -a:

Linux ccs1 4.18.17-300.fc29.x86_64 #1 SMP Mon Nov 5 17:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Running fedora-core 29. Where needed, set environment variables to corresponding Centos

Possibly relevant output from dmesg:

[ 15.783041] nvidia: loading out-of-tree module taints kernel. [ 15.783053] nvidia: module license 'NVIDIA' taints kernel. [ 15.783054] Disabling lock debugging due to kernel taint [ 15.803925] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 15.816497] nvidia-nvlink: Nvlink Core is being initialized, major device number 239 [ 15.817219] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem [ 16.020085] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 410.72 Wed Oct 17 20:08:45 CDT 2018 (using threaded interrupts) [ 16.074132] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 237 [ 16.112723] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 410.72 Wed Oct 17 20:07:15 CDT 2018 [ 16.125118] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver [ 16.125121] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0

nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Sun Nov 25 16:16:52 2018 Driver Version : 410.72 CUDA Version : 10.0

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : GeForce GTX 1050 Ti Product Brand : GeForce Display Mode : Enabled Display Active : Enabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-f281ece7-f156-c934-7dd5-2d33d9339d43 Minor Number : 0 VBIOS Version : 86.07.39.00.30 MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : None IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1C8210DE Bus Id : 00000000:01:00.0 Sub System Id : 0xA45419DA GPU Link Info PCIe Generation Max : 1 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays since reset : 0 Tx Throughput : 0 KB/s Rx Throughput : 265000 KB/s Fan Speed : 45 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 4039 MiB Used : 152 MiB Free : 3887 MiB BAR1 Memory Usage Total : 256 MiB Used : 6 MiB Free : 250 MiB Compute Mode : Default Utilization Gpu : 7 % Memory : 5 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending : N/A Temperature GPU Current Temp : 29 C GPU Shutdown Temp : 102 C GPU Slowdown Temp : 99 C GPU Max Operating Temp : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : N/A Power Limit : 75.00 W Default Power Limit : 75.00 W Enforced Power Limit : 75.00 W Min Power Limit : 52.50 W Max Power Limit : 75.00 W Clocks Graphics : 139 MHz SM : 139 MHz Memory : 405 MHz Video : 544 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 1923 MHz SM : 1923 MHz Memory : 3504 MHz Video : 1708 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes Process ID : 1064 Type : G Name : /usr/libexec/Xorg Used GPU Memory : 80 MiB Process ID : 1657 Type : G Name : /usr/bin/kwin_x11 Used GPU Memory : 22 MiB Process ID : 1664 Type : G Name : /usr/bin/krunner Used GPU Memory : 1 MiB Process ID : 1667 Type : G Name : /usr/bin/plasmashell Used GPU Memory : 44 MiB

docker version

Client: Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-62.git9cb56fd.fc29.x86_64 Go version: go1.11beta2 Git commit: accfe55-unsupported Built: Wed Jul 25 18:54:07 2018 OS/Arch: linux/amd64

Server: Version: 1.13.1 API version: 1.26 (minimum version 1.12) Package version: docker-1.13.1-62.git9cb56fd.fc29.x86_64 Go version: go1.11beta2 Git commit: accfe55-unsupported Built: Wed Jul 25 18:54:07 2018 OS/Arch: linux/amd64 Experimental: false

rpm -qa 'nvidia'

nvidia-xconfig-410.72-1.fc27.x86_64 nvidia-driver-NvFBCOpenGL-410.72-1.fc27.x86_64 nvidia-libXNVCtrl-devel-410.72-1.fc27.x86_64 nvidia-container-runtime-hook-1.4.0-2.x86_64 kmod-nvidia-4.19.3-300.fc29.x86_64-410.72-1.fc29.x86_64 nvidia-driver-cuda-libs-410.72-1.fc27.x86_64 nvidia-driver-libs-410.72-1.fc27.x86_64 akmod-nvidia-410.72-1.fc27.x86_64 nvidia-settings-410.72-1.fc27.x86_64 nvidia-libXNVCtrl-410.72-1.fc27.x86_64 nvidia-driver-NVML-410.72-1.fc27.x86_64 libnvidia-container-tools-1.0.0-1.x86_64 nvidia-driver-devel-410.72-1.fc27.x86_64 nvidia-driver-cuda-410.72-1.fc27.x86_64 kmod-nvidia-4.18.17-300.fc29.x86_64-410.72-1.fc29.x86_64 nvidia-driver-410.72-1.fc27.x86_64 nvidia-persistenced-410.72-1.fc27.x86_64 nvidia-modprobe-410.72-1.fc27.x86_64 libnvidia-container1-1.0.0-1.x86_64

nvidia-container-cli -V

version: 1.0.0 build date: 2018-09-20T20:25+0000 build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1 build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-28) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

nvidia-container-cli -V version: 1.0.0 build date: 2018-09-20T20:25+0000 build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1 build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-28) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

docker run --rm nvidia/cuda:9.0-base nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.

I tried looking for libnvidia-ml.so sudo find / -name libnvidia-ml.so /usr/local/cuda-10.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /var/lib/docker/overlay2/53747a6cff62ecaa574033dc954eaf3b2877ad372dafa63687808ba82b4d259a/diff/usr/local/cuda-10.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so

Addding the stubs dir to end of my PATH/LD_LIBRARY_PATH doesn't help, but in any case I'm not sure it would help in the environment outside the docker.

clythersHackers commented 5 years ago

I'm guessing the problem relates to CUDA 10 being installed while the docker command is using 9.0, which ultimately was my reason for using a dock in the first place. I didn't want the trouble of downgrading CUDA for tensorflow so wanted to use the ready-made ngc dock

RenaudWasTaken commented 5 years ago

Hello!

Sorry for the late reply. CUDA is backwards compatible. Here it seems like we aren't finding libnvidia-ml.so on your system.

Given the symptoms, you are probably stumbling in the issue that was fixed by this commit: NVIDIA/libnvidia-container@deccb28

You would need to wait for the next release of the library (around mid-jan) or build the library by hand.

ljh2057 commented 5 years ago

1.docker run -it --rm --runtime=nvidia tensorflow/tensorflow:latest-gpu-py3 bash 2.ldconfig 3.nvidia-smi //then open a new console write: 4.docker ps //find your container id then save them to disk: 5.docker commit 4f0d5870605f tensorflow/tensorflow:gpu_fixed //then you can use the new container which you saved and in this container you can excute 'nvidia-smi' and use the tensorflow-gpu

RenaudWasTaken commented 5 years ago

This should be fixed with the latest version of the libnvidia-container packages. Closing, feel free to reopen if the bug persists.