NVIDIA / DIGITS

Deep Learning GPU Training System
https://developer.nvidia.com/digits
BSD 3-Clause "New" or "Revised" License
4.12k stars 1.38k forks source link

Getting error: check failed: error == cudaSuccess( 8 vs 0) invalid device function #1863

Open ManojGuha opened 7 years ago

ManojGuha commented 7 years ago

digitserror

I have successfully installed nvidia digts , nvcaffe , cuda 9 , cudnn 7 , nvidia driver: 384 in ubuntu 16.04. I am using Nvidia geforce 940mx gpu. When I try to train caffe model for my dataset I am getting the above mentioned error. Nvcc --version shows successful cuda installation.Also I am able to import caffe in terminal in python. But I am always getting this error. First I had set up digits docker then installed caffe, cuda ,cudnn . I am new to deep learning and cannot figure what is wrong with digits or driver or cuda installation. Please help me

apolo74 commented 6 years ago

Hi ManojGuha, I'm having the same problem as you and I was wondering if you figured out how to solve it. I also installed DIGITs with nvidia-docker image so I was convinced all the necessary libraries like cuda and cudnn were supposed to work without problems of compatibility but Caffe gives this error all the time inside DIGITs. Hope you can share any help

ManojGuha commented 6 years ago

Hi apolo74, Sorry for the late reply, I tried many ways to solve the problem within the docker but nothing worked. I thought that docker was having trouble accessing the nvidia gpu maybe due to incorrect driver or cuda version. Then I removed nvidia docker completely and installed all dependencies on local system.. The problem got solved once I installed Nvidia digits from source without docker following the instructions in this link https://github.com/NVIDIA/DIGITS/blob/master/docs/BuildDigits.md . I hope it helped. Have a good day :)

ividal commented 6 years ago

A dreaded "me too" - since, unlike for @ManojGuha, using DIGITS outside a docker container is not a desired option.

ERROR: Check failed: error == cudaSuccess (8 vs. 0) invalid device function for any Caffe-defined network (this includes the 3 default classification ones), running DIGITS from inside a docker container. Environment:

In case it helps, I am able to train Tensorflow-defined networks with digits. Training code in other docker (non-digits) containers also works well on the same machine, so it is not a docker or nvidia driver issue. It seems to be related to the Caffe build inside the latest three Nvidia DIGITS docker images.

Any help? Thanks!

ividal commented 6 years ago

For those who want to carry on using docker, what solved it for me was to manually edit the Dockerfile they use to add my CUDA architecture (in the case of my Quadro GPU, that would be 50) when Caffe is being built.

For instance, for DIGITS v6.0, this is the original Dockerfile.

I changed:

-DCUDA_ARCH_BIN="35 52 60 61"

to:

-DCUDA_ARCH_BIN="35 50 52 60 61"

If you have a different GPU, check which architecture it corresponds to (e.g. here).

Hope that helps other people that were stuck!