Closed clythersHackers closed 5 years ago
Looks like you are running Red Hat's fork of docker, so you should follow these instructions instead:
https://github.com/NVIDIA/nvidia-docker#centos-7-docker-rhel-7475-docker
You won't install the nvidia-docker2
package, so no daemon.json
conflict.
Thanks, so I carefully removed what I believe is associated with docker2 nvidia-docker2-2.0.3-1.docker1.13.1.noarch nvidia-container-runtime-hook-1.3.0-1.x86_64 nvidia-container-runtime-2.0.0-1.docker1.13.1.x86_64
Then ran the second procedure in your link... up to:
yum install -y nvidia-container-runtime-hook
This went OK, dockerd service starts OK, but on running the test...
Get errors: (was I wrong to remove the container RPMs?)
docker run --rm nvidia/cuda:9.0-base nvidia-smi container_linux.go:247: starting container process caused "process_linux.go:339: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=4517 /var/lib/docker/overlay2/6196a356e3cb283ade76732913a84ce62583e3b0dd4020bcf60042cfbc5b249e/merged]\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 1\n\"" /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:339: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=4517 /var/lib/docker/overlay2/6196a356e3cb283ade76732913a84ce62583e3b0dd4020bcf60042cfbc5b249e/merged]\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 1\n\"".
Hello!
Do you mind giving a bit more information so we can help you debug this:
uname -a
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
or rpm -qa '*nvidia*'
nvidia-container-cli -V
docker run --rm nvidia/cuda:9.0-base nvidia-smi
Hi, thanks for your response, note this is the situation now after following the previous instructions.
Linux ccs1 4.18.17-300.fc29.x86_64 #1 SMP Mon Nov 5 17:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Running fedora-core 29. Where needed, set environment variables to corresponding Centos
[ 15.783041] nvidia: loading out-of-tree module taints kernel. [ 15.783053] nvidia: module license 'NVIDIA' taints kernel. [ 15.783054] Disabling lock debugging due to kernel taint [ 15.803925] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 15.816497] nvidia-nvlink: Nvlink Core is being initialized, major device number 239 [ 15.817219] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem [ 16.020085] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 410.72 Wed Oct 17 20:08:45 CDT 2018 (using threaded interrupts) [ 16.074132] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 237 [ 16.112723] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 410.72 Wed Oct 17 20:07:15 CDT 2018 [ 16.125118] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver [ 16.125121] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
==============NVSMI LOG==============
Timestamp : Sun Nov 25 16:16:52 2018 Driver Version : 410.72 CUDA Version : 10.0
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce GTX 1050 Ti
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-f281ece7-f156-c934-7dd5-2d33d9339d43
Minor Number : 0
VBIOS Version : 86.07.39.00.30
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1C8210DE
Bus Id : 00000000:01:00.0
Sub System Id : 0xA45419DA
GPU Link Info
PCIe Generation
Max : 1
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 265000 KB/s
Fan Speed : 45 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 4039 MiB
Used : 152 MiB
Free : 3887 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 6 MiB
Free : 250 MiB
Compute Mode : Default
Utilization
Gpu : 7 %
Memory : 5 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 29 C
GPU Shutdown Temp : 102 C
GPU Slowdown Temp : 99 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : N/A
Power Limit : 75.00 W
Default Power Limit : 75.00 W
Enforced Power Limit : 75.00 W
Min Power Limit : 52.50 W
Max Power Limit : 75.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1923 MHz
SM : 1923 MHz
Memory : 3504 MHz
Video : 1708 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1064
Type : G
Name : /usr/libexec/Xorg
Used GPU Memory : 80 MiB
Process ID : 1657
Type : G
Name : /usr/bin/kwin_x11
Used GPU Memory : 22 MiB
Process ID : 1664
Type : G
Name : /usr/bin/krunner
Used GPU Memory : 1 MiB
Process ID : 1667
Type : G
Name : /usr/bin/plasmashell
Used GPU Memory : 44 MiB
Client: Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-62.git9cb56fd.fc29.x86_64 Go version: go1.11beta2 Git commit: accfe55-unsupported Built: Wed Jul 25 18:54:07 2018 OS/Arch: linux/amd64
Server: Version: 1.13.1 API version: 1.26 (minimum version 1.12) Package version: docker-1.13.1-62.git9cb56fd.fc29.x86_64 Go version: go1.11beta2 Git commit: accfe55-unsupported Built: Wed Jul 25 18:54:07 2018 OS/Arch: linux/amd64 Experimental: false
nvidia-xconfig-410.72-1.fc27.x86_64 nvidia-driver-NvFBCOpenGL-410.72-1.fc27.x86_64 nvidia-libXNVCtrl-devel-410.72-1.fc27.x86_64 nvidia-container-runtime-hook-1.4.0-2.x86_64 kmod-nvidia-4.19.3-300.fc29.x86_64-410.72-1.fc29.x86_64 nvidia-driver-cuda-libs-410.72-1.fc27.x86_64 nvidia-driver-libs-410.72-1.fc27.x86_64 akmod-nvidia-410.72-1.fc27.x86_64 nvidia-settings-410.72-1.fc27.x86_64 nvidia-libXNVCtrl-410.72-1.fc27.x86_64 nvidia-driver-NVML-410.72-1.fc27.x86_64 libnvidia-container-tools-1.0.0-1.x86_64 nvidia-driver-devel-410.72-1.fc27.x86_64 nvidia-driver-cuda-410.72-1.fc27.x86_64 kmod-nvidia-4.18.17-300.fc29.x86_64-410.72-1.fc29.x86_64 nvidia-driver-410.72-1.fc27.x86_64 nvidia-persistenced-410.72-1.fc27.x86_64 nvidia-modprobe-410.72-1.fc27.x86_64 libnvidia-container1-1.0.0-1.x86_64
version: 1.0.0 build date: 2018-09-20T20:25+0000 build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1 build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-28) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
nvidia-container-cli -V version: 1.0.0 build date: 2018-09-20T20:25+0000 build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1 build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-28) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.
I tried looking for libnvidia-ml.so sudo find / -name libnvidia-ml.so /usr/local/cuda-10.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /var/lib/docker/overlay2/53747a6cff62ecaa574033dc954eaf3b2877ad372dafa63687808ba82b4d259a/diff/usr/local/cuda-10.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
Addding the stubs dir to end of my PATH/LD_LIBRARY_PATH doesn't help, but in any case I'm not sure it would help in the environment outside the docker.
I'm guessing the problem relates to CUDA 10 being installed while the docker command is using 9.0, which ultimately was my reason for using a dock in the first place. I didn't want the trouble of downgrading CUDA for tensorflow so wanted to use the ready-made ngc dock
Hello!
Sorry for the late reply.
CUDA is backwards compatible. Here it seems like we aren't finding libnvidia-ml.so
on your system.
Given the symptoms, you are probably stumbling in the issue that was fixed by this commit: NVIDIA/libnvidia-container@deccb28
You would need to wait for the next release of the library (around mid-jan) or build the library by hand.
1.docker run -it --rm --runtime=nvidia tensorflow/tensorflow:latest-gpu-py3 bash 2.ldconfig 3.nvidia-smi //then open a new console write: 4.docker ps //find your container id then save them to disk: 5.docker commit 4f0d5870605f tensorflow/tensorflow:gpu_fixed //then you can use the new container which you saved and in this container you can excute 'nvidia-smi' and use the tensorflow-gpu
This should be fixed with the latest version of the libnvidia-container packages. Closing, feel free to reopen if the bug persists.
1. Issue or feature description
Sorry if this is a duplicate, but I have been following circular links marking the same issue as a duplicate without actually finding a solution. Warning I'm new to this so the more I patch the more worried I get that I have made a mess and will have to start again from scratch. Right now I have the nvidia docker and container RPM's installed corresponding to my docker version 1.13, so pretty clean and pristine from the RPM installation.
There appears to be a conflict between the nvidia- installed docker.daemon and the default RHEL/CentOs daemon.json, apparently it doesn't like command line specification and daemon.json configuration. It looks like it could be very simple to fix.
In short:
systemctl status docker .....
/etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration ... runtimes: (from flag: [oci], from file: map[nvidia:map[path:/usr/bin/nvidia-container-runtime runtimeArgs:[]]])
2. Steps to reproduce the issue
sudo systemctl restart docker
3. Information to attach (optional if deemed irrelevant)
I tried to clear daemon.json to contain only {}, Then docker runs fine but with default runtime oci and cannot start nvidia GPU images
systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2018-11-14 14:06:24 GMT; 22min ago Docs: http://docs.docker.com Main PID: 2822 (dockerd-current) Tasks: 18 (limit: 8192) Memory: 78.2M CGroup: /system.slice/docker.service └─2822 /usr/bin/dockerd-current --add-runtime oci=/usr/libexec/docker/docker-runc-current --default-runtime=oci --authorization-plugin=rhel-push-plugin --containerd /run/containerd.>