Recommended CUDA/cuDNN versions for running darknet

AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

http://pjreddie.com/darknet/

Other

21.63k stars 7.95k forks source link

Recommended CUDA/cuDNN versions for running darknet #1338

Open fabiannagel opened 6 years ago

fabiannagel commented 6 years ago

I've been trying to run darknet with various combinations of CUDA and cuDNN but there is always something that goes wrong. If I end up fixing compilation errors, I get "Floating point error" or "Floating point error (core dumped)" right after I start training.

Since I want to run darknet on a minimal K80 Google Cloud machine, I don't really care about the software requirements. Can anybody give me some insights? What linux distribution, CUDA/cuDNN/gcc version are you using so that everything works for you?

Thanks a lot!

Background information:

I tried Ubuntu 17.10 with the CUDA 9.0 Debian package for Ubuntu 17.04. Weirdly enough, this let to an installed CUDA v8 (using nvcc --version). Installing cuDNN 9.1 gave me a CUDNN_MAJOR=7... no idea what's happening here. I had to build without OPEN_MP here since this would give me additional errors. After starting training, I got a "Floating point error (core dumped)"
Using the pre-built "Deep Learning VM" from Google Cloud Marketplace, I tried running darknet with CUDA 9.2.148 and cuDNN 7.1 (patch level 4). Building worked fine but I also ended up with a "Floating point error".

AlexeyAB commented 6 years ago

I had to build without OPEN_MP

If you use GPU, then it doesn't matter if OPENMP=0 or not.

an installed CUDA v8 (using nvcc --version). Installing cuDNN 9.1 gave me a CUDNN_MAJOR=7... no idea what's happening here.

There is no cuDNN 9.1: https://developer.nvidia.com/rdp/cudnn-archive There are:

cuDNN v7.0.5 (Dec 11, 2017), for CUDA 9.1
cuDNN v7.1.2 (Mar 21, 2018), for CUDA 9.1 & 9.2

But if you use CUDA v8 then you should use cuDNN for CUDA 8.0 - i.e. cuDNN v7.0.5 (Dec 5, 2017), for CUDA 8.0:

cuDNN v7.0.5 Developer Library for Ubuntu16.04 (Deb): https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/8.0_20171129/Ubuntu16_04-x64/libcudnn7-dev_7.0.5.15-1+cuda8.0_amd64
cuDNN v7.0.5 Developer Library for Ubuntu14.04 (Deb): https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/8.0_20171129/Ubuntu14_04-x64/libcudnn7-dev_7.0.5.15-1+cuda8.0_amd64
cuDNN v7.0.5 Library for Linux (source code): https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/8.0_20171129/cudnn-8.0-linux-x64-v7

fabiannagel commented 6 years ago

Thank you for the fast reply! I was able to install CUDA 8.0 with cuDNN 7.0.5 on Ubuntu 16.04, in both cases using the .deb packages. I even ran the mnistCUDNN verification tool that is mentioned here and everything seems to work fine.

Building and running darknet with GPU=1 and CUDNN=0 works fine. If I also set CUDNN=1, I can build it but again, I immediately get this output:

Loading weights from darknet53.conv.74...Done!
Floating point exception (core dumped)

Interestingly, I can't get rid of this error anymore. Even if I run make clean and build for CPU only, I still get the same floating point exception. Do you think another GPU like a V100 would cause less trouble? Or should I simply build without CUDNN? Thanks again in advance!

AlexeyAB commented 6 years ago

Loading weights from darknet53.conv.74...Done! Floating point exception (core dumped)

What command do you use to get this error?
Can you show screenshot?

Building and running darknet with GPU=1 and CUDNN=0 works fine. Even if I run make clean and build for CPU only, I still get the same floating point exception.

Do you mean that you get this error only if GPU=1 CUDNN=1 or GPU=0 CUDNN=0? But it works fine with GPU=1 CUDNN=0 isn't it?

Also for Tesla K80 you can try to set -gencode arch=compute_37,code=sm_37 \ here: https://github.com/AlexeyAB/darknet/blob/b3b78afb8f313231ab771367af3a60cfedd98c11/Makefile#L16

and do Make

As mentioned for K80 here: https://en.wikipedia.org/wiki/CUDA#GPUs_supported and here: http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/