NERSC / shifter

Shifter - Linux Containers for HPC
Other
348 stars 65 forks source link

caffe with cuda problems #257

Closed woonghu closed 5 years ago

woonghu commented 5 years ago

I have an docker image. (from nvidia/cuda:10.0-cudnn-7-devel-centos7) and I transformed this image to shifter image.

This image also include caffe library.

I've tried to run this image to compute something to use caffe. but, there is an error about cuda.

E0730 02:56:02.434150 29176 common.cpp:114] Cannot create Cublas handle. Cublas won't be available. E0730 02:56:02.434254 29176 common.cpp:121] Cannot create Curand generator. Curand won't be available. F0730 02:56:02.434366 29176 common.cpp:152] Check failed: error == cudaSuccess (35 vs. 0) CUDA driver version is insufficient for CUDA runtime version

this is my udiRoot.conf for cuda setting. ... # https://github.com/NERSC/shifter/issues/223 module_nvidia_siteEnvAppend=LD_LIBRARY_PATH=/opt/udiImage/modules/nvidia PATH=/nvidia-bin PATH=/cuda/bin module_nvidia_siteFs=/usr/bin:/nvidia-bin;/usr/local/cuda:/cuda module_nvidia_copyPath=/usr/lib64/nvidia ...

By the way, when I run this image through Docker, It finish successfully.

I ran this Image through slurm with GRES. and I gave only 1 gpu device for a job.

I found some different between them. In the Shifter Instance, CUDA Version is N/A.

  1. Docker Instance $ nvidia-smi Tue Jul 30 02:46:06 2019
    +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:14:00.0 Off | 0 | | N/A 32C P0 33W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... Off | 00000000:15:00.0 Off | 0 | | N/A 33C P0 32W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla P100-PCIE... Off | 00000000:39:00.0 Off | 0 | | N/A 37C P0 31W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla P100-PCIE... Off | 00000000:3A:00.0 Off | 0 | | N/A 35C P0 26W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

  1. Shifter Instance $ nvidia-smi Tue Jul 30 04:00:27 2019
    +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:14:00.0 Off | 0 | | N/A 32C P0 28W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

is there any hint for this?

woonghu commented 5 years ago

This issue was because of cuda library file. I copied /usr/lib64/libcuda* files to /usr/lib64/nvidia. then it works. Thank you