m1kit commented 4 years ago

Bug Overview

Layer loading is extremely slow, takes about 30 minutes
After layer loading, training fails: avg loss = nan
The bug occurs only if darknet is built with CUDNN=1

Reproduction

Created Dockerfile to reproduce the bug. I built the docker image on a Workstation(refer a environment section).
Prepared my custom dataset and set up on the Workstation. I uploaded a configuration file here.
Run a docker container, with dataset folder mounted and gpu provided.

Log

$ docker run -v ~/darknet/mydata:/app/darknet/data -it --gpus device=0 myname/darknet:1.0.3 detector train -dont_show ./data/obj.data ./data/yolo4.cfg ./data/backup/yolov4.conv.137
 CUDA-version: 10020 (11010), cuDNN: 7.6.5, CUDNN_HALF=1, GPU count: 1  
 CUDNN_HALF=1 
 OpenCV version: 3.2.0
valid: Using default 'data/train.txt'
yolo4
 0 : compute_capability = 860, cudnn_half = 1, GPU: GeForce RTX 3090 
net.optimized_memory = 0 
mini_batch = 2, batch = 64, time_steps = 1, train = 1 
   layer   filters  size/strd(dil)      input                output
   0 
### HANG UP HERE FOR 30 MINUTES ###
conv     32       3 x 3/ 1    768 x 768 x   3 ->  768 x 768 x  32 1.019 BF
   1 conv     64       3 x 3/ 2    768 x 768 x  32 ->  384 x 384 x  64 5.436 BF
   2 conv     64       1 x 1/ 1    384 x 384 x  64 ->  384 x 384 x  64 1.208 BF
   3 route  1                                  ->  384 x 384 x  64 
   4 conv     64       1 x 1/ 1    384 x 384 x  64 ->  384 x 384 x  64 1.208 BF
   5 conv     32       1 x 1/ 1    384 x 384 x  64 ->  384 x 384 x  32 0.604 BF
   6 conv     64       3 x 3/ 1    384 x 384 x  32 ->  384 x 384 x  64 5.436 BF
   7 Shortcut Layer: 4,  wt = 0, wn = 0, outputs: 384 x 384 x  64 0.009 BF
   8 conv     64       1 x 1/ 1    384 x 384 x  64 ->  384 x 384 x  64 1.208 BF
   9 route  8 2                                ->  384 x 384 x 128 
### ... ###
Total BFLOPS 203.057 
avg_outputs = 1670200 
 Allocate additional workspace_size = 52.43 MB 
Loading weights from ./data/backup/yolov4.conv.137...
 seen 64, trained: 0 K-images (0 Kilo-batches_64) 
Done! Loaded 137 layers from weights-file 
Learning Rate: 0.001, Momentum: 0.949, Decay: 0.0005
 Detection layer: 139 - type = 28 
 Detection layer: 150 - type = 28 
 Detection layer: 161 - type = 28 
Resizing, random_coef = 1.40 

 1120 x 1120 
 Create 6 permanent cpu-threads 
 try to allocate additional workspace_size = 52.43 MB 
 CUDA allocate done! 
Loaded: 0.000061 seconds
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 139 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.509620, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 15572.158203, iou_loss = 0.000000, total_loss = 15572.158203 
### ... ###
 1: -nan, -nan avg loss, 0.000000 rate, 5.492178 seconds, 64 images, -1.000000 hours left
### ... ###
 2: -nan, -nan avg loss, 0.000000 rate, 5.540169 seconds, 128 images, 45.767176 hours left

see here for full log

Environment

Docker container on the Workstation

Ubuntu 18.04.5 LTS
darknet d65909fbea471d06e52a2e4a41132380dc2edaa6
CUDA 10.2
cuDNN 7.6.5
1x RTX 3090 provided

Workstation

Ubuntu 20.04.1 LTS
2x RTX 3090 installed
Docker version 19.03.13, build 4484c46d9d

Client (for investigation)

Windows 10 build 19608
1x GTX 1080 installed
CUDA 10.0
cuDNN 7.6.0

Investigation

Validation of dataset

Apart from the workstation, I installed darknet using vcpkg on the client(windows), and trained the same dataeset with gpu. It worked well and trained model seems to be fine, therefore I think there is nothing wrong with dataset.

Identification of causes

I changed compile option and tested.

CUDNN=0 CUDNN_HALF=0: fine(ofcource it's slow since cuDNN is disabled)
CUDNN=1 CUDNN_HALF=0: bug occurs
CUDNN=1 CUDNN_HALF=1: bug occurs

Thus, I concluded this bug occurs only if darknet is built with CUDNN=1.

Also, I tested some CUDA version.

CUDA 10.0 bug occurs
CUDA 10.1 bug occurs
CUDA 10.2 bug occurs
CUDA 11.0 compile fail Unsupported gpu architecture 'compute_30'

CUDA versions do not matter. It seems to be a problem with CUDA 11 is another issue.

And then, I tested some darknet version. I'm sorry that I forget the version I tested, but it still failed with version around 2020.03 . Therefore I think it's not caused by recent change. This might be a compatibility issue with RTX 3090, docker, or cuDNN 7.6.5.

LeKristapino commented 4 years ago

I'm facing the exact same issue. Also running on RTX 3090. Should not be docker-related issue, since CUDNN is working fine on AWS GPU accelerated instances (using Tesla cards) through docker

m1kit commented 4 years ago

I tested cuDNN with TITAN X, it worked well. Maybe darknet does not fully support RTX3090...

takashide commented 4 years ago

I tried CUDNN=1 CUDNN_HALF=1 cudnn 8.0.4 cuda 11.1 CV 4.5

comment out compute_30 "+" -gencode arch=compute_86,code=[sm_86,compute_86]

Build is OK!

LeKristapino commented 4 years ago

@takashide Thanks for the solution. Works great on CUDA 11.1 CUDNN Docker Image, with your suggested modifications

m1kit commented 4 years ago

Thank you @takashide 👍

AlgirdasKartavicius commented 4 years ago

@takashide Thanks for the solution. Works great on CUDA 11.1 CUDNN Docker Image, with your suggested modifications

Do you have inference performace metrics with rtx3090?

takashide commented 4 years ago

@takashide Thanks for the solution. Works great on CUDA 11.1 CUDNN Docker Image, with your suggested modifications

Do you have inference performace metrics with rtx3090?

In my experience, it's more than twice as much as 1080ti.

LeKristapino commented 4 years ago

@AlgirdasKartavicius I recently compared RTX3090 vs MSI gs65 laptop running RTX 2070

Inference: RTX 3090 - 0,047 seconds RTX 2070 Laptop card - 0,11 seconds

Planning to compare it also with 3070, 3080 and any other NVIDIA cards I can get my hands on, since it's hard to find good comparisons for deep learning, and yolo specifically.

textcolor commented 3 years ago

unfortunately I don't seem to get this to work. Whenever I try to make the project, I get Unsupported gpu architecture 'compute_86' (On a 3090, previous 2080Ti worked just fine) My current makefile looks like this:

GPU=1
CUDNN=1
CUDNN_HALF=0
OPENCV=0
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0
ZED_CAMERA_v2_8=0

# set GPU=1 and CUDNN=1 to speedup on GPU
# set CUDNN_HALF=1 to further speedup 3 x times (Mixed-precision on Tensor Cores) GPU: Volta, Xavier, Turing and higher
# set AVX=1 and OPENMP=1 to speedup on CPU (if error occurs then set AVX=0)
# set ZED_CAMERA=1 to enable ZED SDK 3.0 and above
# set ZED_CAMERA_v2_8=1 to enable ZED SDK 2.X

USE_CPP=0
DEBUG=0

ARCH= -gencode arch=compute_35,code=sm_35 \
      -gencode arch=compute_50,code=[sm_50,compute_50] \
      -gencode arch=compute_52,code=[sm_52,compute_52] \
            -gencode arch=compute_61,code=[sm_61,compute_61]

OS := $(shell uname)

# GeForce RTX 3070, 3080, 3090
ARCH= -gencode arch=compute_86,code=[sm_86,compute_86]

# Kepler GeForce GTX 770, GTX 760, GT 740
# ARCH= -gencode arch=compute_30,code=sm_30

# Tesla A100 (GA100), DGX-A100, RTX 3080
# ARCH= -gencode arch=compute_80,code=[sm_80,compute_80]

# Tesla V100
# ARCH= -gencode arch=compute_70,code=[sm_70,compute_70]

# GeForce RTX 2080 Ti, RTX 2080, RTX 2070, Quadro RTX 8000, Quadro RTX 6000, Quadro RTX 5000, Tesla T4, XNOR Tensor Cores
# ARCH= -gencode arch=compute_75,code=[sm_75,compute_75]

# Jetson XAVIER
# ARCH= -gencode arch=compute_72,code=[sm_72,compute_72]

# GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030, Titan Xp, Tesla P40, Tesla P4
# ARCH= -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61

# GP100/Tesla P100 - DGX-1
# ARCH= -gencode arch=compute_60,code=sm_60

# For Jetson TX1, Tegra X1, DRIVE CX, DRIVE PX - uncomment:
# ARCH= -gencode arch=compute_53,code=[sm_53,compute_53]

# For Jetson Tx2 or Drive-PX2 uncomment:
# ARCH= -gencode arch=compute_62,code=[sm_62,compute_62]

# For Tesla GA10x cards, RTX 3090, RTX 3080, RTX 3070, RTX A6000, RTX A40 uncomment:
# ARCH= -gencode arch=compute_86,code=[sm_86,compute_86]

[...]

What am I missing?

LeKristapino commented 3 years ago

@textcolor What is your CUDA version?

Most likely you have an older CUDA. Try using 11.0 or 11.1, then everything should work

textcolor commented 3 years ago

@LeKristapino this could be the issue. nvidia-smi reports CUDA Version: 11.1, but nvcc --version reports release 10.1, V10.1.243. This is a mismatch, or are those actually two different things?

LeKristapino commented 3 years ago

@textcolor Seems like that could be the issue. I have:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0

NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1

textcolor commented 3 years ago

@textcolor Seems like that could be the issue. I have:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0

NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1

Looks like I screwed up somehow. How exactly did you install cuda? I just installed it via ubuntu-drivers install and apt install nvidia-cuda-toolkit But apparently those produce mismatched versions. Trying now the Nvidia-preferred method by directly downloading the installer from developer.download.nvidia.com, will update if that'll do the trick

LeKristapino commented 3 years ago

@textcolor I can't really help with cuda installation instructions, since I'm using Docker containers to run both - training and inference. For my host computer I only need the correct nvidia-driver version.

Possibly you can peek in the nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04 Dockerfile. Your problem is exactly the reason why I try to use containers and do the minimum installations required on the host computer :)

mkzein commented 3 years ago

@LeKristapino Do you remember how much it would take to traing yolov4tiny on RTX3090? Thank you.

stephanecharette commented 3 years ago

Do you remember how much it would take to traing yolov4tiny on RTX3090?

YOLOv4-tiny @ 416x416 running on a RTX2070 (less than what you ask) takes anywhere from 1 hour to 31 hours to train depending on these factors: https://www.ccoderun.ca/programming/darknet_faq/#time_to_train

AlexeyAB / darknet

Training fails only if cuDNN enabled with RTX3090 #6853

Bug Overview

Reproduction

Log

Environment

Docker container on the Workstation

Workstation

Client (for investigation)

Investigation

Validation of dataset

Identification of causes