Closed cyberwillis closed 5 years ago
In the past we had issues with GTX 690: #206 We are not using the same function calls in 2.0, but maybe it's related.
Can you copy the output of nvidia-smi -q
?
Hi, sure! Appear to be the same !
Just to let you know, I am running Ubuntu 16.04 LTS. On this machine I just have Cuda 9.0 and cuDnn 7, but on the container I had installed Cuda 8.0 and cuDnn 6.
$ uname -r 4.11.0-14-lowlatency
$ dmesg | grep -i nvidia [ 1.175191] nvidia: loading out-of-tree module taints kernel. [ 1.175298] nvidia: module license 'NVIDIA' taints kernel. [ 1.187647] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 1.192626] nvidia-nvlink: Nvlink Core is being initialized, major device number 245 [ 1.193015] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem [ 1.193350] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.90 Tue Sep 19 19:17:35 PDT 2017 (using threaded interrupts) [ 1.195440] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 384.90 Tue Sep 19 17:05:19 PDT 2017 [ 1.196231] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver [ 1.196330] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 0 [ 1.196544] [drm] [nvidia-drm] [GPU ID 0x00000500] Loading driver [ 1.196648] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:05:00.0 on minor 1 [ 4.774623] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 243 [ 5.082426] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:02:00.0/0000:03:08.0/0000:04:00.1/sound/card1/input23 [ 5.082523] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/0000:02:00.0/0000:03:08.0/0000:04:00.1/sound/card1/input24 [ 5.082614] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:02:00.0/0000:03:08.0/0000:04:00.1/sound/card1/input25 [ 5.082666] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:02:00.0/0000:03:10.0/0000:05:00.1/sound/card2/input22 [ 6.893867] nvidia-modeset: Allocated GPU:0 (GPU-8d95715f-6f13-b43f-cf13-a49a24d5a88b) @ PCI:0000:04:00.0 [ 6.894342] nvidia-modeset: Allocated GPU:1 (GPU-36f5b8ae-6b82-7edd-67f0-e5b88c16adc5) @ PCI:0000:05:00.0
[UPDATED] : output in pastebin: Output: nvidia-smi -q
Hi again,
Also I don't found 'nvidia-docker-plugin' on the system. I installed nvidia-docker2 from binary package but appear it wasn't installed.
There is no more nvidia-docker-plugin
with v2.
I compiled your code snipped suggested on the #206
$ ./nvml_crash terminate called after throwing an instance of 'std::runtime_error' what(): nvmlDeviceGetTopologyCommonAncestor(dev1, dev2, &topo) error: 999 [1] 20176 abort (core dumped) ./nvml_crash
Indeed, looks like the nvml bug. Can you try to launch a cuda sample (e.g. deviceQuery) see if that works
docker run -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 \
--rm nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
# apt-get update && apt-get install --no-install-recommends cuda-samples-8-0
# cd /usr/local/cuda/samples/1_Utilities/deviceQuery && make
# ./deviceQuery
Hi again,
I did the following tests:
I got a strange behaviour running nvidia-smi by setting the environment variable NVIDIA_VISIBLE_DEVICES to values (0, 1 or all) in two different cases:
I have noticed that setting the NVIDIA_VISIBLE_DEVICES=1 from the "terminal only" force the nvidia-smi respond as if 0 was setted, but doing the the same thing from terminal in desktop environment (lightdm activated) it causes the "unknonw error". Weird.
So I decided to look to new drivers and updated from nvidia-384.90 (native in cuda 9.0) to nvidia-384.98 released 11/02/2017, but nothing changed.
Then I executed the scripts you suggested also switching the environments, here are the results:
from the terminal only and from terminal in desktop environment, both gave me the same result
$ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 690" CUDA Driver Version / Runtime Version 9.0 / 8.0 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 1997 MBytes (2094202880 bytes) ( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores GPU Max Clock rate: 1020 MHz (1.02 GHz) Memory Clock rate: 3004 Mhz Memory Bus Width: 256-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 690 Result = PASS
See that from terminal in desktop environment, the device 1 fail
./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 38 -> no CUDA-capable device is detected Result = FAIL
See that from TERMINAL ONLY, the device 0 is returned
./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 690" CUDA Driver Version / Runtime Version 9.0 / 8.0 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 1999 MBytes (2096300032 bytes) ( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores GPU Max Clock rate: 1020 MHz (1.02 GHz) Memory Clock rate: 3004 Mhz Memory Bus Width: 256-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 690 Result = PASS
See that from terminal ONLY and from terminal in desktop environment, both devices are returned
./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "GeForce GTX 690" CUDA Driver Version / Runtime Version 9.0 / 8.0 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 2046 MBytes (2145189888 bytes) ( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores GPU Max Clock rate: 1020 MHz (1.02 GHz) Memory Clock rate: 3004 Mhz Memory Bus Width: 256-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "GeForce GTX 690" CUDA Driver Version / Runtime Version 9.0 / 8.0 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 2046 MBytes (2145189888 bytes) ( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores GPU Max Clock rate: 1020 MHz (1.02 GHz) Memory Clock rate: 3004 Mhz Memory Bus Width: 256-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from GeForce GTX 690 (GPU0) -> GeForce GTX 690 (GPU1) : No > Peer access from GeForce GTX 690 (GPU1) -> GeForce GTX 690 (GPU0) : No deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 2, Device0 = GeForce GTX 690, Device1 = GeForce GTX 690 Result = PASS
I didn't tested yet with earlier driver (nvidia-375) , but I tested the cuda-8.0 with the recent driver nvidia-384.90 and got the same behaviour.
Can you provide:
The log output of nvidia-container-runtime
edit /etc/nvidia-container-runtime/config.toml
, uncomment debug=...
and run the container (drag and drop the file in argument here)
The output of findmnt
and cat /sys/fs/cgroup/devices/devices.list
inside the container that fails
The output ofnvidia-smi -q
and nvidia-smi
outside the container after reproducing the failure.
Hi again, sorry about the delay to get back to you. So, using the current driver (384.90)
1.1 - trying to access only device 0 of GTX 690:
sudo rm /var/log/nvidia-container-runtime-hook.log && \ docker run -it \ --runtime=nvidia \ -e NVIDIA_VISIBLE_DEVICES=0 \ --rm nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 \ nvidia-smi && \ cat /var/log/nvidia-container-runtime-hook.log > ~/dump-log-device0.txt Tue Nov 14 15:02:32 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.90 Driver Version: 384.90 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A | | 33% 45C P8 N/A / N/A | 884MiB / 2045MiB | N/A Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 Not Supported | +-----------------------------------------------------------------------------+Generated content file : dump-log-device0.txt
1.2 - trying to access only device 1 of GTX 690:
sudo rm /var/log/nvidia-container-runtime-hook.log && \ docker run -it \ --runtime=nvidia \ -e NVIDIA_VISIBLE_DEVICES=1 \ --rm nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 \ nvidia-smi && \ cat /var/log/nvidia-container-runtime-hook.log > ~/dump-log-device1.txt Unable to determine the device handle for GPU 0000:05:00.0: Unknown ErrorGenerated content file : dump-log-device1.txt
One strange behaviour here is , if I try to disable lightdm and run everything on the terminal only, trying to access the device 1 is auto substitutes to the device 0 ! The same behaviour exists in the new driver 384.98, but I couldn't install sucessfully the earlier driver 375.xx Nvidia doesn't allow me ( I tried to block instalation of the new ones without success)
2.1 - findmnt
sh -c "docker run -it \ --runtime=nvidia \ -e NVIDIA_VISIBLE_DEVICES=1 \ --rm nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 \ findmnt;" > ~/dump-findmnt-device1.txtGenerated content file : dump-findmnt-device1.txt
2.1 - cat /sys/fs/cgroup/devices/devices.list
sh -c "docker run -it \ --runtime=nvidia \ -e NVIDIA_VISIBLE_DEVICES=1 \ --rm nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 \ cat /sys/fs/cgroup/devices/devices.list;" > ~/dump-device-list-device1.txtGenerated content file : dump-findmnt-device1.txt
sh -c "nvidia-smi && nvidia-smi -q;" > ~/dump-nvidia-smi-host.txtGenerated content file : dump-nvidia-smi-host.txt
Can you double check that le log is indeed not there, this shouldn't happen.
Also it seems like you ran nvidia-smi
from within the container not the host.
And device 1 is not substituted, devices are renumbered inside the container, that's expected.
Hi again, You right, sorry about that ! I did captured the log file manually this time.
Thank you
sudo rm /var/log/nvidia-container-runtime-hook.log docker run -it --runtime=nvidia \ -e NVIDIA_VISIBLE_DEVICES=1 \ --rm nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 nvidia-smi cat /var/log/nvidia-container-runtime-hook.log > ~/dump-log-device1.txt
Generated content file : dump-log-device1.txt
Can you double check that le log is indeed not there, this shouldn't happen. Also it seems like you ran nvidia-smi from within the container not the host.
Yeah I remember this thing bein populated in the past , in other cuda version, My Cuda Tookit installation came from cuda 9 local repository deb file option of Nvidia.
sh -c "nvidia-smi && nvidia-smi -q;" > ~/new-smi-from-host.txt
content file : new-dump-smi-from-host.txt
Do you know any way I can block the new driver to be installed ? Then I could jumpt to cuda 8.0 with the earlier drivers and do the tests again...
I'm afraid this is the same driver issue as #206. Can you launch the container with NVIDIA_VISIBLE_DEVICES=all
, run the following program and give me back the output?
docker run -it --runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=all \
--rm nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
cat > sample.cu <<EOF
#include <cuda.h>
#include <stdio.h>
#include <assert.h>
int main()
{
CUdevice dev;
int n;
assert(cuInit(0) == CUDA_SUCCESS);
assert(cuDeviceGet(&dev, 0) == CUDA_SUCCESS);
assert(cuDeviceGetAttribute(&n, CU_DEVICE_ATTRIBUTE_MULTI_GPU_BOARD, dev) == CUDA_SUCCESS);
printf("%d\n", n);
}
EOF
nvcc sample.cu -lcuda
./a.out
Hi... Yes I believe also that is a driver problem The output of the script was:
root@5a2d07ac7658:/# ./a.out 1
Interesting, can you try all the other configuration to see if the result is similar (0, 1, with and without desktop)
Sure,
I generated a container based in Nvidia/Cuda
FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 ADD sample.cu / RUN nvcc sample.cu -lcuda CMD nvidia-smi && /a.out
Then I executed the six cases with nvidia-smi && /a.out :
Look that at terminal 1 it not fail but switch to GPU 0 (in last output).
Wed Nov 15 02:22:48 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.90 Driver Version: 384.90 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A | | 36% 50C P8 N/A / N/A | 551MiB / 2045MiB | N/A Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 690 Off | 00000000:05:00.0 N/A | N/A | | 34% 46C P8 N/A / N/A | 551MiB / 2045MiB | N/A Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 Not Supported | | 1 Not Supported | +-----------------------------------------------------------------------------+ 1
Wed Nov 15 02:22:57 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.90 Driver Version: 384.90 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A | | 36% 50C P8 N/A / N/A | 551MiB / 2045MiB | N/A Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 Not Supported | +-----------------------------------------------------------------------------+ 1
Unable to determine the device handle for GPU 0000:05:00.0: Unknown Error
Wed Nov 15 02:20:14 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.90 Driver Version: 384.90 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A | | 37% 53C P0 N/A / N/A | 0MiB / 1997MiB | N/A Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 690 Off | 00000000:05:00.0 N/A | N/A | | 34% 48C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 Not Supported | | 1 Not Supported | +-----------------------------------------------------------------------------+ 1
Wed Nov 15 02:20:02 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.90 Driver Version: 384.90 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A | | 37% 53C P0 N/A / N/A | 0MiB / 1997MiB | N/A Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 Not Supported | +-----------------------------------------------------------------------------+ 1
Wed Nov 15 02:19:50 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.90 Driver Version: 384.90 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 690 Off | 00000000:05:00.0 N/A | N/A | | 33% 48C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 Not Supported | +-----------------------------------------------------------------------------+ 1
[UPDATED]: Desktop-1 was incorrect
Does the sample work for Desktop-1? It's not being executed because of your &&
Hi again, Sorry about the confusion
FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 ADD sample.cu / RUN nvcc sample.cu -lcuda CMD /a.out
$ docker run -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 --rm cyberwillis/debug:latest ./a.out Result:
a.out: sample.cu:10: int main(): Assertion `cuInit(0) == CUDA_SUCCESS' failed.
I believe I found a way to roll back to an older driver I will try now!
[UPDATE]: Drivers I tested I just installed this drivers and used the generated container to try to access the exact GPU 1
NVIDIA-Linux-x86_64-304.137.run (nvidia-smi forces GPU fan and never stops its process) NVIDIA-Linux-x86_64-340.104.run (claims that needs Cuda >= 8.0 by the container) NVIDIA-Linux-x86_64-370.28.run (same behaviour) NVIDIA-Linux-x86_64-375.82.run (same behaviour) NVIDIA-Linux-x86_64-384.98.run (same behaviour) NVIDIA-Linux-x86_64-387.12.run (same behaviour)
I conclude that is a bug never solved in GTX 690 or a BIOS related problem.
Yes, I will report the bug internally and will update this issue once we know more about it.
Hello!
Thanks for opening this issue. Looking at this after some time, it looks like the internal bug was closed soon after. This should have been fixed in recent (or even a sightly older releases).
Sorry to disapoint. I still have this card on other machine and it still has the same problem I am using the driver 415.27 already
This is unfortunate, I'll re-open the bug internally.
Hello @cyberwillis !
We are trying to get a repro of this bug internally but are having a hard time getting our hands on a GTX 690.
Do you think you could hand us a log of the the nvml crash? To do this, you need to run the NVML application with these environment variables set:
__NVML_DBG_LVL=DEBUG
__NVML_DBG_FILE=/tmp/nvml.log
Thanks!
Hi @RenaudWasTaken , sorry I am late to answer your question but here it is. Thank you
docker run -it --runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=all \
-e __NVML_DBG_LVL=DEBUG \
-e __NVML_DBG_FILE=/tmp/nvml.log \
--name cuda10 \
--rm \
nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
# nvidia-smi
Accessing the cards all (0,1)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A |
| 35% 49C P8 N/A / N/A | 690MiB / 1998MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 690 Off | 00000000:05:00.0 N/A | N/A |
| 33% 46C P8 N/A / N/A | 690MiB / 1998MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
+-----------------------------------------------------------------------------+
docker run -it --runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=0 \
-e __NVML_DBG_LVL=DEBUG \
-e __NVML_DBG_FILE=/tmp/nvml.log \
--name cuda10 \
--rm \
nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
# nvidia-smi
Answer: card0
Tue Feb 26 13:19:51 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A |
| 35% 48C P8 N/A / N/A | 690MiB / 1998MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
docker run -it --runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=1 \
-e __NVML_DBG_LVL=DEBUG \
-e __NVML_DBG_FILE=/tmp/nvml.log \
--name cuda10 \
--rm \
nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
# nvidia-smi
Answer: card1
Unable to determine the device handle for GPU 0000:05:00.0: Unknown Error
@RenaudWasTaken Now running the following code on each container: Code
cat <<EOF | tee nvml.cc
#include <stdexcept>
#include <nvml.h>
#define NVML_CALL(call) \
do { \
nvmlReturn_t ret = call; \
if (ret != NVML_SUCCESS) throw std::runtime_error(std::string(#call) + " error: " + std::to_string(ret)); \
} while (0)
int main()
{
NVML_CALL(nvmlInit());
nvmlDevice_t dev1, dev2;
NVML_CALL(nvmlDeviceGetHandleByIndex(0, &dev1));
NVML_CALL(nvmlDeviceGetHandleByIndex(1, &dev2));
nvmlGpuTopologyLevel_t topo;
NVML_CALL(nvmlDeviceGetTopologyCommonAncestor(dev1, dev2, &topo));
}
EOF
g++ -std=c++11 -I /usr/local/cuda/include nvml.cc -lnvidia-ml -o nvml
./nvml
RESULT
terminate called after throwing an instance of 'std::runtime_error'
what(): nvmlDeviceGetTopologyCommonAncestor(dev1, dev2, &topo) error: 3
Aborted (core dumped)
nvml.cc-cardall-nvml.log.tar.gz
RESULT
terminate called after throwing an instance of 'std::runtime_error'
what(): nvmlDeviceGetHandleByIndex(1, &dev2) error: 2
Aborted (core dumped)
RESULT
terminate called after throwing an instance of 'std::runtime_error'
what(): nvmlDeviceGetHandleByIndex(0, &dev1) error: 999
Aborted (core dumped)
@cyberwillis it seems like your card is an SLI slave. The easy workaround for this would be to unlink the SLI device if you need to pass the devices to different containers.
Let me know if this works for you :)
Hi @RenaudWasTaken, Thank you for you fastest reply.
I believe that there is no way to UNSLI the GTX 690 it's a card that has two 680 in SLI mode by default. Before even do that experiments I executed this line and restarted the computer just to make sure:
sudo nvidia-xconfig --multigpu=off --sli=off
Altough my configuration after and before the restart, when the Xserver is active is the following:
Using X configuration file: "/etc/X11/xorg.conf".
ServerLayout "Layout0"
|
|--> Screen "Screen0"
| |
| |--> Monitor "Monitor0"
| | |
| | |--> VendorName "Unknown"
| | |--> ModelName "DELL U2312HM"
| | |--> HorizSync 30.0-83.0
| | |--> VertRefresh 56.0-76.0
| | |--> Option "DPMS"
| |
| |--> Device "Device0"
| | |--> Driver "nvidia"
| | |--> VendorName "NVIDIA Corporation"
| | |--> BoardName "GeForce GTX 690"
| |
| |--> Option "Coolbits" "4"
| |--> Option "Stereo" "0"
| |--> Option "nvidiaXineramaInfoOrder" "DFP-0"
| |--> Option "metamodes" "GPU-8d95715f-6f13-b43f-cf13-a49a24d5a88b.GPU-0.DVI-I-1: nvidia-auto-select +0+0, GPU-8d95715f-6f13-b43f-cf13-a49a24d5a88b.GPU-0.DVI-D-0: nvidia-auto-select +1920+0, GPU-36f5b8ae-6b82-7edd-67f0-e5b88c16adc5.GPU-1.DVI-I-1: nvidia-auto-select +3840+0"
| |--> Option "BaseMosaic" "on"
| |--> Option "Clone" "off"
| |--> Option "MultiGPU" "off"
| |--> Option "SLI" "off"
| |--> DefaultColorDepth 24
|
|--> InputDevice "Keyboard0"
| |
| |--> Driver "kbd"
| |--> Option "CoreKeyboard"
|
|--> InputDevice "Mouse0"
| |
| |--> Driver "mouse"
| |--> Option "Protocol" "auto"
| |--> Option "Device" "/dev/psaux"
| |--> Option "Emulate3Buttons" "no"
| |--> Option "ZAxisMapping" "4 5"
| |--> Option "CorePointer"
|
|--> Option "Xinerama" "0"
As you can see only the Mosaic is On because I supply 3 monitors here with this board. But that will not affect if I do the experiment with the Display Manager Service stopped (lightdm). So I did the same experiment with the X Display Manager turned off (I lose two monitors). Can you take a look ?
Result of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A |
| 38% 54C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 690 Off | 00000000:05:00.0 N/A | N/A |
| 35% 50C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
+-----------------------------------------------------------------------------+
terminalonly-cardall-nvml.log.tar.gz
Result of nvml
terminate called after throwing an instance of 'std::runtime_error'
what(): nvmlDeviceGetTopologyCommonAncestor(dev1, dev2, &topo) error: 3
terminalonly-nvml-cc-cardall-nvml.log.tar.gz
Result of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A |
| 37% 54C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
Note: here the correct address of the card 0 (04:00.0) is showed.
terminalonly-card0-nvml.log.tar.gz
Result of nvml
terminate called after throwing an instance of 'std::runtime_error'
what(): nvmlDeviceGetHandleByIndex(1, &dev2) error: 2
terminalonly-nvml-cc-card0-nvml.log.tar.gz
Result of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 Off | 00000000:05:00.0 N/A | N/A |
| 34% 48C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
Note: here the correct address of the card 1 (05:00.0) is showed.
terminalonly-card1-nvml.log.tar.gz
Result of nvml
terminate called after throwing an instance of 'std::runtime_error'
what(): nvmlDeviceGetHandleByIndex(1, &dev2) error: 2
terminalonly-nvml-cc-card1-nvml.log.tar.gz
it seems like your card is an SLI slave. The easy workaround for this would be to unlink the SLI device if you need to pass the devices to different containers.
There is another way to unlik SLI ? If you know some hard way too let me know too !
@RenaudWasTaken, @flx42 , @3XX0
I believe it's solved ! :) I could replicate the previous scenario from Terminal only but in X Display this time.
Using the Nvidia X Server Settings I turned off Mosaic Mode (Surrounding), next I created a X Screen for each individual display, enabled Xinerama and restarted the machine.
As Renaud said earlier:
This should have been fixed in recent (or even a sightly older releases).
Fun fact: is in the past I did made this configuration and did not worked at that time.
Now I can execute this on Unit:
Note: look how each GPU here has its own memory consumption. Earlier it was just the same value for both.
docker run -it --runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=all \
-e __NVML_DBG_LVL=DEBUG \
-e __NVML_DBG_FILE=/tmp/nvml.log \
--name cuda10 \
--rm \
nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A |
| 37% 51C P8 N/A / N/A | 868MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 690 Off | 00000000:05:00.0 N/A | N/A |
| 33% 46C P8 N/A / N/A | 441MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
+-----------------------------------------------------------------------------+
docker run -it --runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=0 \
-e __NVML_DBG_LVL=DEBUG \
-e __NVML_DBG_FILE=/tmp/nvml.log \
--name cuda10 \
--rm \
nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 Off | 00000000:04:00.0 N/A | N/A |
| 38% 52C P8 N/A / N/A | 868MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
docker run -it --runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=1 \
-e __NVML_DBG_LVL=DEBUG \
-e __NVML_DBG_FILE=/tmp/nvml.log \
--name cuda10 \
--rm \
nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 Off | 00000000:05:00.0 N/A | N/A |
| 33% 46C P8 N/A / N/A | 441MiB / 1999MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
Thanks to your advice in disable SLI mode I could remember retry to Enable Xinerama mode and test it. I leave to you, to do any comments or close it.
Woot!
Hi guys,
I was building my nvidia/cuda image from the source and after completion successfully I got one strange error selecting the device 1 see bellow the results.
BTW: I am switching from Nvidia-docker 1.0 to Nvidia-docker 2.0
OK
OK
Unknown Error