Closed m1kit closed 4 years ago
I'm facing the exact same issue. Also running on RTX 3090. Should not be docker-related issue, since CUDNN is working fine on AWS GPU accelerated instances (using Tesla cards) through docker
I tested cuDNN with TITAN X, it worked well. Maybe darknet does not fully support RTX3090...
I tried CUDNN=1 CUDNN_HALF=1 cudnn 8.0.4 cuda 11.1 CV 4.5
comment out compute_30 "+" -gencode arch=compute_86,code=[sm_86,compute_86]
Build is OK!
@takashide Thanks for the solution. Works great on CUDA 11.1 CUDNN Docker Image, with your suggested modifications
Thank you @takashide 👍
@takashide Thanks for the solution. Works great on CUDA 11.1 CUDNN Docker Image, with your suggested modifications
Do you have inference performace metrics with rtx3090?
@takashide Thanks for the solution. Works great on CUDA 11.1 CUDNN Docker Image, with your suggested modifications
Do you have inference performace metrics with rtx3090?
In my experience, it's more than twice as much as 1080ti.
@AlgirdasKartavicius I recently compared RTX3090 vs MSI gs65 laptop running RTX 2070
Inference: RTX 3090 - 0,047 seconds RTX 2070 Laptop card - 0,11 seconds
Planning to compare it also with 3070, 3080 and any other NVIDIA cards I can get my hands on, since it's hard to find good comparisons for deep learning, and yolo specifically.
unfortunately I don't seem to get this to work. Whenever I try to make the project, I get Unsupported gpu architecture 'compute_86' (On a 3090, previous 2080Ti worked just fine) My current makefile looks like this:
GPU=1
CUDNN=1
CUDNN_HALF=0
OPENCV=0
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0
ZED_CAMERA_v2_8=0
# set GPU=1 and CUDNN=1 to speedup on GPU
# set CUDNN_HALF=1 to further speedup 3 x times (Mixed-precision on Tensor Cores) GPU: Volta, Xavier, Turing and higher
# set AVX=1 and OPENMP=1 to speedup on CPU (if error occurs then set AVX=0)
# set ZED_CAMERA=1 to enable ZED SDK 3.0 and above
# set ZED_CAMERA_v2_8=1 to enable ZED SDK 2.X
USE_CPP=0
DEBUG=0
ARCH= -gencode arch=compute_35,code=sm_35 \
-gencode arch=compute_50,code=[sm_50,compute_50] \
-gencode arch=compute_52,code=[sm_52,compute_52] \
-gencode arch=compute_61,code=[sm_61,compute_61]
OS := $(shell uname)
# GeForce RTX 3070, 3080, 3090
ARCH= -gencode arch=compute_86,code=[sm_86,compute_86]
# Kepler GeForce GTX 770, GTX 760, GT 740
# ARCH= -gencode arch=compute_30,code=sm_30
# Tesla A100 (GA100), DGX-A100, RTX 3080
# ARCH= -gencode arch=compute_80,code=[sm_80,compute_80]
# Tesla V100
# ARCH= -gencode arch=compute_70,code=[sm_70,compute_70]
# GeForce RTX 2080 Ti, RTX 2080, RTX 2070, Quadro RTX 8000, Quadro RTX 6000, Quadro RTX 5000, Tesla T4, XNOR Tensor Cores
# ARCH= -gencode arch=compute_75,code=[sm_75,compute_75]
# Jetson XAVIER
# ARCH= -gencode arch=compute_72,code=[sm_72,compute_72]
# GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030, Titan Xp, Tesla P40, Tesla P4
# ARCH= -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61
# GP100/Tesla P100 - DGX-1
# ARCH= -gencode arch=compute_60,code=sm_60
# For Jetson TX1, Tegra X1, DRIVE CX, DRIVE PX - uncomment:
# ARCH= -gencode arch=compute_53,code=[sm_53,compute_53]
# For Jetson Tx2 or Drive-PX2 uncomment:
# ARCH= -gencode arch=compute_62,code=[sm_62,compute_62]
# For Tesla GA10x cards, RTX 3090, RTX 3080, RTX 3070, RTX A6000, RTX A40 uncomment:
# ARCH= -gencode arch=compute_86,code=[sm_86,compute_86]
[...]
What am I missing?
@textcolor What is your CUDA version?
Most likely you have an older CUDA. Try using 11.0 or 11.1, then everything should work
@LeKristapino this could be the issue. nvidia-smi reports CUDA Version: 11.1, but nvcc --version reports release 10.1, V10.1.243. This is a mismatch, or are those actually two different things?
@textcolor Seems like that could be the issue. I have:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0
NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1
@textcolor Seems like that could be the issue. I have:
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Tue_Sep_15_19:10:02_PDT_2020 Cuda compilation tools, release 11.1, V11.1.74 Build cuda_11.1.TC455_06.29069683_0
NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1
Looks like I screwed up somehow. How exactly did you install cuda?
I just installed it via
ubuntu-drivers install
and
apt install nvidia-cuda-toolkit
But apparently those produce mismatched versions. Trying now the Nvidia-preferred method by directly downloading the installer from developer.download.nvidia.com, will update if that'll do the trick
@textcolor I can't really help with cuda installation instructions, since I'm using Docker containers to run both - training and inference. For my host computer I only need the correct nvidia-driver version.
Possibly you can peek in the nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04 Dockerfile. Your problem is exactly the reason why I try to use containers and do the minimum installations required on the host computer :)
@LeKristapino Do you remember how much it would take to traing yolov4tiny on RTX3090? Thank you.
Do you remember how much it would take to traing yolov4tiny on RTX3090?
YOLOv4-tiny @ 416x416 running on a RTX2070 (less than what you ask) takes anywhere from 1 hour to 31 hours to train depending on these factors: https://www.ccoderun.ca/programming/darknet_faq/#time_to_train
Bug Overview
avg loss = nan
CUDNN=1
Reproduction
Log
see here for full log
Environment
Docker container on the Workstation
Workstation
Client (for investigation)
Investigation
Validation of dataset
Apart from the workstation, I installed darknet using
vcpkg
on the client(windows), and trained the same dataeset with gpu. It worked well and trained model seems to be fine, therefore I think there is nothing wrong with dataset.Identification of causes
I changed compile option and tested.
CUDNN=0 CUDNN_HALF=0
: fine(ofcource it's slow since cuDNN is disabled)CUDNN=1 CUDNN_HALF=0
: bug occursCUDNN=1 CUDNN_HALF=1
: bug occursThus, I concluded this bug occurs only if darknet is built with
CUDNN=1
.Also, I tested some CUDA version.
CUDA 10.0
bug occursCUDA 10.1
bug occursCUDA 10.2
bug occursCUDA 11.0
compile failUnsupported gpu architecture 'compute_30'
CUDA versions do not matter. It seems to be a problem with CUDA 11 is another issue.
And then, I tested some darknet version. I'm sorry that I forget the version I tested, but it still failed with version around 2020.03 . Therefore I think it's not caused by recent change. This might be a compatibility issue with RTX 3090, docker, or cuDNN 7.6.5.