AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.8k stars 7.97k forks source link

CMake build abandoned? #4794

Closed andreyhristov closed 4 years ago

andreyhristov commented 4 years ago

Hi, I see that CMakeLists.txt was last changed 6 months ago while Makefile is much more fresh (last edited 2 months ago). I built a .so with CMake and Makefile and then ran few tests. I have longer training times when building with CMake (can't copy the defines right now). On one machine with 2080s training with Makefile (GPU=1, CUDNN=1, CUDNN_HALF=0, AVX=0 (oldish CPU without AVX2), OPENCV=0) takes 820-830 minutes to train. CMake build took 1054 minutes (almost 30% more time). On another machine with 2080s and AVX2 capable CPU I trained with the setting above but AVX=1, CUDNN_HALF=1 and commented out generation of 70 code for CUDNN_HALF

@@ -119,7 +119,7 @@
 ifeq ($(CUDNN_HALF), 1)
 COMMON+= -DCUDNN_HALF
 CFLAGS+= -DCUDNN_HALF
-ARCH+= -gencode arch=compute_70,code=[sm_70,compute_70]
+#ARCH+= -gencode arch=compute_70,code=[sm_70,compute_70]
 endif

as I have already uncommented ARCH for 2080s:

 # GeForce RTX 2080 Ti, RTX 2080, RTX 2070, Quadro RTX 8000, Quadro RTX 6000, Quadro RTX 5000, Tesla T4, XNOR Tensor Cores
-# ARCH= -gencode arch=compute_75,code=[sm_75,compute_75]
+ARCH= -gencode arch=compute_75,code=[sm_75,compute_75]

The result was 1591min, when with CUDNN_HALF=0, AVX=0 and sm_70 I was getting 800min on the very same machine. I am puzzled.

AlexeyAB commented 4 years ago

I see that CMakeLists.txt was last changed 6 months ago while Makefile is much more fresh (last edited 2 months ago).

Cmake is simply made well by @cenit and does not require changes.

The training time should be the same if you use in both cases: GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 DEBUG=0 (it doesn't matter: AVX, OPENMP, LIBSO, ...)

cenit commented 4 years ago

Of course I can do a mistake and maybe some compiler options might be wrong in a corner case. The goal of cmake anyway, as @alexeyab says, is to provide a stable and "self adjusting" platform, for any OS, any computing platform and any depending library available. So it is a good sign if it does not require constant update :)

andreyhristov commented 4 years ago

@cenit You are right, still the psychology is that if something hasn't be touched it might look abandoned. In this regard, a note in the documentation regarding this might be a hint to those who wanna use the CMake build. I like CMake and like to use it wherever possible. Cheers, Andrey

cenit commented 4 years ago

Please let me know the precise results of your investigations: if you find any case in which a cmake-built darknet is slower than a make-built darknet (with identical setup), then please provide full logs to better understand what went wrong. Underneath the compiler must be the same, so also the results must be equal!

andreyhristov commented 4 years ago

Sure. I am trying to help. It just takes half a day or more to test a configuration :( but will keep you posted.

andreyhristov commented 4 years ago

Hi again, not on the CMake build, but I found what really killed the performance of trainings. Setting OPENCV to 0 kills the performance big time. 1000 iterations take about 100% more time when opencv is not used. Other observation is that CUDNN_HALF somehow doesn't speed up the training. After reading some source code in most parts the #ifdef is commented out. The parts that use cudnn_half (the tensor cores) are after 3000 iterations but even after that I don't see much speedup in the generation of weights. Here are the weights created for a run with CUDNN_HALF=1 (The patch for Makefile I use: https://pastebin.com/d513KiHr)

Test started at 21:52
feb  4 22:22 yolov3-obj_1000.weights
feb  4 22:51 yolov3-obj_2000.weights
feb  4 23:21 yolov3-obj_3000.weights
feb  4 23:50 yolov3-obj_4000.weights
feb  5 00:20 yolov3-obj_5000.weights
feb  5 00:49 yolov3-obj_6000.weights
feb  5 01:18 yolov3-obj_7000.weights
feb  5 01:46 yolov3-obj_8000.weights
feb  5 02:14 yolov3-obj_9000.weights

Here is for a build with CUDNN_HALF=0 (the patch for the Makefile I use: https://pastebin.com/qgkMcRCr)

Test started at 11:52
feb  5 12:22 yolov3-obj_1000.weights
feb  5 12:50 yolov3-obj_2000.weights
feb  5 13:17 yolov3-obj_3000.weights
feb  5 13:47 yolov3-obj_4000.weights
feb  5 14:16 yolov3-obj_5000.weights
feb  5 14:46 yolov3-obj_6000.weights
feb  5 15:14 yolov3-obj_7000.weights
feb  5 15:44 yolov3-obj_8000.weights
feb  5 16:13 yolov3-obj_9000.weights

Here is a comparison table

iter      HALF=1  HALF=0
0k-1k:     30m        30m
1k-2k:     29m        28m
2k-3k:     30m        27m
3k-4k:     29m        30m
4k-5k:     30m        29m
5k-6k:     29m        30m
6k-7k:     29m        28m
7k-8k:     28m        30m
8k-9k:     28m        29m

Hardware is HP Z440, E5-1650v4, 128GB RAM, Gigabyte RTX 2080s.

Thanks much for you time and attention!

cenit commented 4 years ago

OpenCV is important indeed. And that's why @AlexeyAB told you to verify that cmake found opencv. If not, it would still build darknet without it, impacting performances

AlexeyAB commented 4 years ago

https://github.com/AlexeyAB/darknet#improvements-in-this-repository

improved performance 3.5 X times of data augmentation for training (using OpenCV SSE/AVX functions instead of hand-written functions) - removes bottleneck for training on multi-GPU or GPU Volta

andreyhristov commented 4 years ago

Back to CMake build. CMake doesn't find that the card can do half precision. The card is RTX 2080s. Seems can't detect the architecture of the card at all. Here is the Dockerfile which I use

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && \
    apt-get install -y \
    libopencv-dev \
    python-dev \
    python3-pip \
    libgomp1 \
    wget \
    git cmake  && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN mkdir darknet

RUN git clone https://github.com/AlexeyAB/darknet /darknet

RUN cd /darknet && cmake -L . 

The output from CMake is :

-- The C compiler identification is GNU 7.4.0
-- The CXX compiler identification is GNU 7.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - /usr/local/cuda/bin/nvcc
-- The CUDA compiler identification is NVIDIA 10.0.130
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "10.0") 
-- Automatic GPU detection failed. Building for common architectures.
-- Autodetected CUDA architecture(s): 3.0;3.5;5.0;5.2;6.0;6.1;7.0;7.0+PTX
-- Building with CUDA flags: -gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70
-- Your setup does not supports half precision (it requires CC >= 7.5)
-- Found OpenCV: /usr (found version "3.2.0") 
CMake Warning at CMakeLists.txt:131 (message):
  To build with OpenMP support you need CMake 3.11.0+

-- Found Stb: /darknet/3rdparty/stb/include  
--   ->  darknet is fine for now, but uselib_track has been disabled!
--   ->  Please rebuild OpenCV from sources with CUDA support to enable it
-- Found CUDNN: /usr/include (found version "7.6.5") 
-- CMAKE_CUDA_FLAGS: -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 --compiler-options " -Wall -Wno-unused-result -Wno-unknown-pragmas -Wfatal-errors -Wno-deprecated-declarations -Wno-write-strings -DGPU -DCUDNN -DOPENCV -fPIC -fopenmp -Ofast " 
-- ZED SDK not found
-- Configuring done
-- Generating done
-- Build files have been written to: /darknet
-- Cache values
BUILD_AS_CPP:BOOL=OFF
BUILD_SHARED_LIBS:BOOL=ON
BUILD_USELIB_TRACK:BOOL=FALSE
CMAKE_BUILD_TYPE:STRING=
CMAKE_CUDA_HOST_COMPILER:FILEPATH=
CMAKE_INSTALL_PREFIX:PATH=/darknet
CUDA_ARCHITECTURES:STRING=Auto
CUDA_HOST_COMPILER:FILEPATH=/usr/bin/cc
CUDA_SDK_ROOT_DIR:PATH=CUDA_SDK_ROOT_DIR-NOTFOUND
CUDA_TOOLKIT_ROOT_DIR:PATH=/usr/local/cuda
CUDA_USE_STATIC_CUDA_RUNTIME:BOOL=ON
CUDA_rt_LIBRARY:FILEPATH=/usr/lib/x86_64-linux-gnu/librt.so
ENABLE_CUDA:BOOL=ON
ENABLE_CUDNN:BOOL=ON
ENABLE_CUDNN_HALF:BOOL=FALSE
ENABLE_OPENCV:BOOL=ON
ENABLE_VCPKG_INTEGRATION:BOOL=ON
ENABLE_ZED_CAMERA:BOOL=FALSE
INSTALL_BIN_DIR:PATH=/darknet
INSTALL_CMAKE_DIR:PATH=share/darknet
INSTALL_INCLUDE_DIR:PATH=include/darknet
INSTALL_LIB_DIR:PATH=/darknet
MANUALLY_EXPORT_TRACK_OPTFLOW:BOOL=OFF
OpenCV_DIR:PATH=/usr/share/OpenCV
SELECT_OPENCV_MODULES:BOOL=OFF
Stb_DIR:PATH=/darknet/3rdparty/stb
USE_INTEGRATED_LIBS:BOOL=FALSE
ZED_DIR:PATH=ZED_DIR-NOTFOUND
cenit commented 4 years ago

Ubuntu 18.04 stock has a very old cmake You need to update it to have proper support for your brand new card

Something along the following line added to your script might do the trick

CMAKE_VERSION="3.16.4"
wget --no-check-certificate https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-Linux-x86_64.tar.gz
tar -xzf cmake-${CMAKE_VERSION}-Linux-x86_64.tar.gz
export PATH=$PWD/cmake-${CMAKE_VERSION}-Linux-x86_64/bin:$PATH
AlexeyAB commented 4 years ago

@cenit https://github.com/AlexeyAB/darknet/blob/d51d89053afc4b7f50a30ace7b2fcf1b2ddd7598/CMakeLists.txt#L68-L71

CUDNN_HALF should be available for CC >= 7.0 https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

andreyhristov commented 4 years ago

@cenit Yesterday I compiled OpenCV myself because I wanted to have CUDA support in OpenCV without upgrading 18.04 stock CMake and OpenCV detected correctly, in the very same container, the Cuda Capabilities. So, it should be possible even with the ancient version. @AlexeyAB Yes, CUDNN_HALF is available for 7.0+, however the CMake build is not able to detect the capabilities and thus fp16 is off. Once CMake is hinted to use 7.5 as architecture everything falls into place. On the side of performance, for me at least, CUDNN_HALF doesn't bring any speedup (or if, it is 1-2% which is next to the error between runs). openmp doesn't bring anything too, even after installing libgomp1. AVX doesn't reduce the training time. The OpenCV with CUDA neither. I don't see more cores used with openmp enabled (this is with Makefile build, not with CMake build). Empirically 6 cores (best is real cores) per 2070s, 2080, 2080s are perfect for our training scenarios. The scenario is tested by pinning the training process by utilizing taskset to specific cores. Going from 1 to 2 GPUs scales linearly, which is a silver lining.

AlexeyAB commented 4 years ago

@andreyhristov

Most of the information is written here: https://github.com/AlexeyAB/darknet#improvements-in-this-repository

andreyhristov commented 4 years ago

@AlexeyAB Thank you for the clarification. Re OpenMP I expected that since some code still runs on the CPU that OpenMP might bring something there. Seems I was wrong, sorry. Re CUDNN_HALF=1, coming from Tensorflow training (Faster RCNN) which can do half precision training with half precision bringing down training times significantly without need for more GPU memory I had some expectations. And this is part of the idea of having fp16 for training - speeding up training, lowering memory usage during training allowing training of larger models, at a minimal loss of precision. In our case we use directly TRT Engine on Xavier AGX so inference with darknet is not an option. Btw, the link you give mentions : "CUDNN_HALF=1 to build for Tensor Cores (on Titan V / Tesla V100 / DGX-2 and later) speedup Detection 3x, Training 2x". Re OpenCV-CUDA, ok, did not know that but again some experience. I was looking for a way to speed up the data augmentation as this runs on the CPU right now, right? All in all, seems that there is nothing I can win in speed improvement by tweaking flags and speeding up dependencies. Thanks for all your work!