AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.63k stars 7.95k forks source link

Recommended CUDA/cuDNN versions for running darknet #1338

Open fabiannagel opened 6 years ago

fabiannagel commented 6 years ago

I've been trying to run darknet with various combinations of CUDA and cuDNN but there is always something that goes wrong. If I end up fixing compilation errors, I get "Floating point error" or "Floating point error (core dumped)" right after I start training.

Since I want to run darknet on a minimal K80 Google Cloud machine, I don't really care about the software requirements. Can anybody give me some insights? What linux distribution, CUDA/cuDNN/gcc version are you using so that everything works for you?

Thanks a lot!

Background information:

AlexeyAB commented 6 years ago

I had to build without OPEN_MP

If you use GPU, then it doesn't matter if OPENMP=0 or not.

an installed CUDA v8 (using nvcc --version). Installing cuDNN 9.1 gave me a CUDNN_MAJOR=7... no idea what's happening here.

There is no cuDNN 9.1: https://developer.nvidia.com/rdp/cudnn-archive There are:


But if you use CUDA v8 then you should use cuDNN for CUDA 8.0 - i.e. cuDNN v7.0.5 (Dec 5, 2017), for CUDA 8.0:

fabiannagel commented 6 years ago

Thank you for the fast reply! I was able to install CUDA 8.0 with cuDNN 7.0.5 on Ubuntu 16.04, in both cases using the .deb packages. I even ran the mnistCUDNN verification tool that is mentioned here and everything seems to work fine.

Building and running darknet with GPU=1 and CUDNN=0 works fine. If I also set CUDNN=1, I can build it but again, I immediately get this output:

Loading weights from darknet53.conv.74...Done!
Floating point exception (core dumped)

Interestingly, I can't get rid of this error anymore. Even if I run make clean and build for CPU only, I still get the same floating point exception. Do you think another GPU like a V100 would cause less trouble? Or should I simply build without CUDNN? Thanks again in advance!

AlexeyAB commented 6 years ago

Loading weights from darknet53.conv.74...Done! Floating point exception (core dumped)

  1. What command do you use to get this error?
  2. Can you show screenshot?

Building and running darknet with GPU=1 and CUDNN=0 works fine. Even if I run make clean and build for CPU only, I still get the same floating point exception.

  1. Do you mean that you get this error only if GPU=1 CUDNN=1 or GPU=0 CUDNN=0? But it works fine with GPU=1 CUDNN=0 isn't it?

Also for Tesla K80 you can try to set -gencode arch=compute_37,code=sm_37 \ here: https://github.com/AlexeyAB/darknet/blob/b3b78afb8f313231ab771367af3a60cfedd98c11/Makefile#L16

and do Make

As mentioned for K80 here: https://en.wikipedia.org/wiki/CUDA#GPUs_supported and here: http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/