I added support for Tensor Cores, which speedup Detection and Training 3x on GPU CC >=7.0

AlexeyAB commented 6 years ago

I added support for Tensor Cores, which should speed up Detection and Training 3x times on GPU since Volta-architecture (Nvidia TITAN V (V100), ...) with CC >= 7.0 and using CUDA >= 9.0 and cuDNN >= 7.0.

The use of Tensor Cores will be turned on automatically using Float-32bit - if your GPU supports it.

But if you want to speedup more using Mixed-precision (FP32+FP16), then you should set define CUDNN_HALF

on Windows: open \darknet.sln -> (right click on project) -> properties -> C/C++ -> Preprocessor -> Preprocessor Definitions, and add at the beginning of line: CUDNN_HALF;
on Linux: open Makefile and set CUDNN_HALF=1: https://github.com/AlexeyAB/darknet/blob/140333977cea0ba9e384cd38fd01013a8915ef60/Makefile#L3

For training on Amazon EC2 p3.2xlarge - p3.16xlarge or remote DGX-2/1 server use flag -dont_show: ./darknet detector train data/obj.data yolo-obj.cfg darknet53.conv.74 -dont_show

Detection - Forward inference time:

Model	FP32 (Tesla V100), sec	Tensor Cores FP16/32 (Tesla V100), sec	Speedup X times
yolov3.cfg	0.031	0.011	2.8x
yolo-voc.2.0.cfg	0.02	0.0062	3.2x
tiny-yolo-voc.cfg	0.003	0.0027	1.1x

Training - Forward+Backward+Update time (batch=64 subdivision=4 width=416 height=416 random=0):

Model	FP32 (Tesla V100), sec	Tensor Cores FP16/32 (Tesla V100), sec	Speedup X times
yolov3.cfg	1.9	1.27	1.5x
yolo-voc.2.0.cfg	0.89	0.39	2.3x
tiny-yolo-voc.cfg	0.23	0.165	1.4x

Tiny-yolo can not be much accelerated, because it has a very small size. Yolov3 isn't accelerated more for training because it has too much layers, and in the current implementation the data transfer between layers uses FP32.

For Amazon EC2 use:

p3.2xlarge 1 x Tesla V100 - p3.16xlarge DGX-2 (8 x Tesla V100 with nvLink) (link)
Deep Learning Base AMI (Ubuntu) Version 3.0 (ami-38c87440) - Deep Learning base AMI with NVidia drivers like CUDA 8 and 9, CuDNN 6 and 7, CuBLAS 8 and 9, NCCL

AlexeyAB commented 6 years ago

More about Tensor Cores: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
More about Mixed-precision using Tensor Cores: https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/

5x performance: tensor_core_cudnn_speedup-1-e1508222018353-768x470

The same precision: tensor_precision

I did a little test - just trained my model using FP32 and FP32+FP16 - 3300 iterations:

FP32: has 0.43 avg loss, mAP = 11.67%, IoU = 25.52%
FP32+FP16: has 0.43 avg loss, mAP = 11.46%, IoU = 26.07% So result the ~same.

Some informaton about implementation:

Tensor Cores will be used only if you use cuDNN and if cuDNN-version >= 7.0
Tensor Cores are used only for Convolution & Batch-normalization forward and backward, but aren't used for Activation - so we shouldn't use Loss Scaling.
Used CUDNN_DATA_FLOAT computeType for convolutional descriptor to use Accumulation into FP32
We use FP32 Master Copy of Weights and update it
By default Tensor Cores will use algorithms *_ALGO_WINOGRAD_NONFUSED for FP32
If you use CUDNN_HALF (FP32+16) then Tensor Cores can use algorithms *CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM for forward, CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 for backward, CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 for backward filter) More info about using Tensor Cores with cuDNN: http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor_ops

Comparison number of Tensor Cores and common Float-32-Cores: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Volta_series

Tesla V100 contains 80 SMs x 8 Tensor Cores x 64 Float-16/32 Cores x 2 FMA x 1.2 GHz = 81 920 FP-operations per clock x 1.2 GHz = 98 Tflops-fp16/32 (Tensor cores)
Tesla V100 contains 80 SMs x 64 Float-32 Cores x 2 FMA x 1.2 GHz = 10 240 FP-operations per clock x 1.2 GHz = 12 Tflops-fp32

Each Tensor Core provides a 4x4x4 matrix processing and performs 64 floating point FMA mixed-precision operations per clock.

What do the Tensor Cores do: image4-768x191

image11-768x300

FilipaRamos commented 6 years ago

I am not sure if this is related and it is my fault probably but I'm just gonna leave this here in case it can help somebody else. Yesterday I pulled this commit and with CUDA 9.0 and CuDNN 7.0 I enabled mixed-precision. I do not know anything about hardware so I apologize if this sounds really dumb. I restarted training yolo with my custom dataset from that and everything was running smoothly. After a couple of hours, my computer turned off on its own whilst I was having lunch. I came to realize that my nvidia gpu is not recognized anymore by the computer and is basically dead.

I am just leaving this here for future reference for people who like me do not realize the implications of the misuse of hardware :cry:

AlexeyAB commented 6 years ago

@FilipaRamos What GPU did you use? This is either a coincidence, or an overheat, or a serious bug in the drivers or nVidia hardware.

AlexeyAB commented 6 years ago

For the fastest training using mixed-precision with defined CUDNN_HALF as described here: https://github.com/AlexeyAB/darknet/issues/407#issue-300029052 I recommend to use only GPU Volta (Tesla V100, TITAN V, ...).

Or you can use Amazon EC2 servers p3.* with GPU Volta Tesla V100: https://aws.amazon.com/ec2/pricing/on-demand/?nc1=h_ls

amazon_v100

More about GPU Volta on EC2: https://aws.amazon.com/ru/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/

voltaec2

FilipaRamos commented 6 years ago

@AlexeyAB My GPU is GeForce GTX 950M so yeah I know, it was yolo. It's not a software problem, it's just that I pushed the hardware too much. I've opened up my laptop and I managed to bring the gpu back to life. Still I think it won't last for too long. My gpu-z logs show that it was an overheat issue. Guess this is an opportunity to upgrade my hardware

adonishong commented 6 years ago

Appreciate for your work. ^_^. Have a issue here...

Training with FDDB for face detection, config file: yolo.cfg, cuda 9.1, cudnn 7.1.1(7.1 also...), driver 390, V100 Without CUDNN_HALF, there is no problem with training. Once enable CUDNN_HALF, I will get this immediately: Region Avg IOU: 0.000000, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000, count: 10

AlexeyAB commented 6 years ago

@adonishong Thanks, something wrong in the source code, I'll check it.

adonishong commented 6 years ago

appreciate for your work, and looking forward to your update ^_^

AndreasAakerberg commented 6 years ago

Has anyone gotten an inference performance boost with FP16 precision on the Jetson TX2 platform? Using Cuda 9 and cuDNN 7 on the TX2 I do not see any performance difference between using an FP16 based build compared to an FP32. Does the model have to be in FP16 format in order to get a speedup?

lamerman commented 6 years ago

I have the same problem with Titan V

   30 conv    125  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 125
   31 detection
mask_scale: Using default '1.000000'
Loading weights from weights/darknet19_448.conv.23...
 seen 32 
Done!
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Resizing
480
 try to allocate workspace = 42348544 * sizeof(float),  CUDA allocate done! 
Loaded: 0.000052 seconds
Region Avg IOU: 0.261185, Class: nan, Obj: 0.208333, No Obj: 0.250000, Avg Recall: 0.250000,  count: 12
Region Avg IOU: 0.129104, Class: nan, Obj: 0.142857, No Obj: 0.250000, Avg Recall: 0.142857,  count: 14
Region Avg IOU: 0.262299, Class: nan, Obj: 0.233333, No Obj: 0.250000, Avg Recall: 0.400000,  count: 15
Region Avg IOU: 0.282436, Class: nan, Obj: 0.338710, No Obj: 0.250000, Avg Recall: 0.161290,  count: 31
Region Avg IOU: 0.297492, Class: nan, Obj: 0.238095, No Obj: 0.250000, Avg Recall: 0.333333,  count: 21
Region Avg IOU: 0.268952, Class: nan, Obj: 0.230769, No Obj: 0.250000, Avg Recall: 0.307692,  count: 13
Region Avg IOU: 0.296595, Class: nan, Obj: 0.250000, No Obj: 0.250000, Avg Recall: 0.500000,  count: 8
Region Avg IOU: 0.212846, Class: nan, Obj: 0.153846, No Obj: 0.250000, Avg Recall: 0.307692,  count: 13

 1: -nan, -nan avg, 0.000000 rate, 4.897003 seconds, 64 images
Loaded: 0.000059 seconds
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000,  count: 15
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000,  count: 22
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000,  count: 10
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000,  count: 11
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000,  count: 12
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000,  count: 12

Without CUDNN_HALF everything is fine.

kanrrra commented 6 years ago

@AndreasAakerberg Did you ever figure it out? I wanted to use this for faster inference on a P100 but it's not making any difference. I checked the code and the weights are only converted to 16 bit right before they are loaded into the gpu so I would think you can use any previously trained model.

AndreasAakerberg commented 6 years ago

@kanrrra No i did not find any solutions. I am considering trying darkflow as the latest TensorFlow now has integrated TensorRT which should enable us to use FP16 and graph optimization with darknet (https://devblogs.nvidia.com/tensorrt-integration-speeds-tensorflow-inference/?ncid=so-lin-gc18tt-35202&_lrsc=8d1a9273-e1f6-41df-9edc-219bcc76b9af&ncid=so-lin-lt-798).

abagshaw commented 6 years ago

Same problem with nan showing up everywhere when training with P100 and compiling with CUDNN_HALF

AlexeyAB commented 6 years ago

@abagshaw @adonishong @lamerman @AndreasAakerberg @kanrrra @FilipaRamos

I fixed few bugs to use Tensor Cores on GPU Tesla V100 (Volta) for detection and training:

CUDA 9.1, cuDNN 7 for CUDA 9.1, OpenCV 3.4.0
GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 in the Makefile (OpenCV can give ~2x speedup of training neural network with high resolution width=1024 height=1024 or higher due to 3.5x times accelerated data augmentation using OpenCV )
Detection - Forward inference time:

Model	FP32 (Tesla V100), sec	Tensor Cores FP16/32 (Tesla V100), sec	Speedup X times
yolov3.cfg	0.031	0.011	2.8x
yolo-voc.2.0.cfg	0.02	0.0062	3.2x
tiny-yolo-voc.cfg	0.003	0.0027	1.1x

Training - Forward+Backward+Update time (batch=64 subdivision=4 width=416 height=416 random=0):

Model	FP32 (Tesla V100), sec	Tensor Cores FP16/32 (Tesla V100), sec	Speedup X times
yolov3.cfg	1.9	1.27	1.5x
yolo-voc.2.0.cfg	0.89	0.39	2.3x
tiny-yolo-voc.cfg	0.23	0.165	1.4x

Tiny-yolo can not be much accelerated, because it has a very small size. Yolov3 isn't accelerated more for training because has too much layers, but in the current implementation data transfer between layers is using FP32.

I successfully trained on Tesla V100 (Tensor Cores FP16/32) my custom tiny-model for very small objects (size of objects 1x1 - 50x50) 6000 iterations:

Avg loss is about ~= 0.11 on 6000 iteration.

detections_count = 43384, unique_truth_count = 3012
class_id = 0, name = air,        ap = 65.16 %
class_id = 1, name = bird,       ap = 60.53 %
 for thresh = 0.25, precision = 0.65, recall = 0.68, F1-score = 0.67
 for thresh = 0.25, TP = 2058, FP = 1094, FN = 954, average IoU = 44.22 %

 mean average precision (mAP) = 0.628447, or 62.84 %

Accuracy (mAP) ~ is the same as using FP32-cores.

teslav100

tensor_cores_fp16

abagshaw commented 6 years ago

@AlexeyAB Thanks for all your work on this! Too bad speedup isn't quite 5x, maybe the title of this issue should be changed 😄.

AlexeyAB commented 6 years ago

@abagshaw Yes, 4-5x times only for Convolutional layer, but for the entire network only 3x times speedup )

lamerman commented 6 years ago

I tried your patched version and it seems to work 2 times faster on Titan V. Thank you very much for your work.

AlexeyAB commented 6 years ago

Accelerated by another 5% using mixed-precision FP16/32 Batch-normalization for training on Tensor Cores. Loss (0.11) and mAP ~= the same.

detections_count = 47229, unique_truth_count = 3012
class_id = 0, name = air,        ap = 67.74 %
class_id = 1, name = bird,       ap = 66.08 %
 for thresh = 0.25, precision = 0.67, recall = 0.72, F1-score = 0.70
 for thresh = 0.25, TP = 2177, FP = 1075, FN = 835, average IoU = 45.44 %

 mean average precision (mAP) = 0.669062, or 66.91 %

loss

adonishong commented 6 years ago

thanks for you work ^_^ @AlexeyAB

kmsravindra commented 6 years ago

Thanks @AlexeyAB. Can I train the model on GTX 1080 Ti and do inference using Titan V ( I have this need to predict on HD at real time though I can compromise on my training time and hence the question.)

AlexeyAB commented 6 years ago

@kmsravindra Yes. You can train on the GTX 1080 Ti with CUDNN_HALF=0 And then use this weights-file for detection on Titan V with CUDNN_HALF=1

kmsravindra commented 6 years ago

Thank you!

315386775 commented 6 years ago

@AlexeyAB Thanks for your work. I got 2x times speed up with GPU Tesla V100 (Volta) . But with nan showing up everywhere. Using your fixed few bugs version i cannot successful compilation.

CUDA 9.0, cuDNN 7 for CUDA 9.0, OpenCV 2.4.13

I must use CUDA 9.1 and OpenCV 3.4.0?

AlexeyAB commented 6 years ago

@315386775

Using your fixed few bugs version i cannot successful compilation.

So you can't compile latest version of this repository, isn't it?

Yes, you should use CUDA 9.1 and OpenCV 3.4.0

kmsravindra commented 5 years ago

@AlexeyAB , As per this https://en.wikipedia.org/wiki/GeForce_20_series,

RTX 2080Ti seems to give an extra processing boost of GFLOPS using half-precision - 23500 (26896) compared to single precision.

Is there any setting that I can use to run yolov3 with half-precision converted weights just like I did for INT8 yolo2_light? Will INT8 version on RTX 2080Ti also speeds up the FPS compared to GTX 1080Ti?

AlexeyAB commented 5 years ago

@kmsravindra

You can use this repository with GPU=1 CUDNN=1 CUDNN_HALF=1 to use Tensor Cores FP16 on RTX 2080 Ti
I didn't add implicit implementation of FP16 to the https://github.com/AlexeyAB/yolo2_light but I added flag for automatic using of Tensor Cores with automatic conversion FP32->FP16->FP32 - if you use cuDNN >= 7.2

Also to achive the highest speed ~220 Tops - INT8x32 on Tensor Cores, so you can try to use if un-comment these lines:

But as described here, INT8x32 is supported only on CC 7.2 Xavier (not higher or lower) https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips

Note: Note that this CUDNN_DATA_INT8x32 is only supported by sm_72.

Also I get run-time error when I try to create any cuDNN descriptor for CUDNN_DATA_INT8x32: https://github.com/AlexeyAB/yolo2_light/issues/18#issuecomment-432452312

turing

kmsravindra commented 5 years ago

@AlexeyAB , Thank you very much

alexanderfrey commented 5 years ago

I currently have an 1080ti which does 24FPS with YOLOv3. Does the 2080ti really do 3 times as much with Tensorcores when you say 3x speedup ?

billmguo commented 5 years ago

@AlexeyAB any luck progress on the CUDNN_DATA_INT8x32, I checked with Nvidia, they claim the sm_75 supports Tensor Core INT8. For Tensor core INT8, the data type should be CUDNN_DATA_INT8x32 and filter format should be CUDNN_TENSOR_NCHW_VECT_C. Tensor core INT8 requires the Channel/Kernel size must be aligned by 32.the CUDNN sample code is in the conv_sample.cpp does pass the INT8x32.

AlexeyAB commented 5 years ago

@billmguo

Tensor core INT8 requires the Channel/Kernel size must be aligned by 32.the CUDNN sample code is in the conv_sample.cpp does pass the INT8x32.

As I know only Channel must be aligned by 32, not kernel_size, since it usually 1x1 - 3x3. Can you quote Doc or source code where is stated about "Kernel size must be aligned by 32"?

I tried it here, but I even can't create cudnn descriptor with INT8x32 and filters=128, channels=128, kernel_w=3, kernel_h=3: https://github.com/AlexeyAB/yolo2_light/issues/18#issuecomment-433146751

billmguo commented 5 years ago

Hi Alexey

Nvidia confirm this is a known issue for 4DDescriptor. Please use NdDescriptor for vectorized data input

Thanks/Min

On Mon, Jan 14, 2019 at 12:47 PM Alexey notifications@github.com wrote:

@billmguo https://github.com/billmguo

Tensor core INT8 requires the Channel/Kernel size must be aligned by 32.the CUDNN sample code is in the conv_sample.cpp does pass the INT8x32.

As I know only Channel must be aligned by 32, not kernel_size, since it usually 1x1 - 3x3. Can you quote Doc or source code where is stated about "Kernel size must be aligned by 32"?

I tried it here, but I even can't create cudnn descriptor with INT8x32 and filters=128, channels=128, kernel_w=3, kernel_h=3: AlexeyAB/yolo2_light#18 (comment) https://github.com/AlexeyAB/yolo2_light/issues/18#issuecomment-433146751

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexeyAB/darknet/issues/407#issuecomment-454155388, or mute the thread https://github.com/notifications/unsubscribe-auth/AKNSgBDZt5L9CFiB6AGrDTWS_y0WRu2eks5vDOzMgaJpZM4SSTsa .

kmsravindra commented 5 years ago

@kmsravindra Yes. You can train on the GTX 1080 Ti with CUDNN_HALF=0 And then use this weights-file for detection on Titan V with CUDNN_HALF=1

Hi Alexey, A quick question...Can we do the other way round? Train a model with CUDNN_HALF=1 and use that model to predict on a different machine with CUDNN_HALF=0 ?

thanks!

AlexeyAB commented 5 years ago

@kmsravindra Yes.

phexic commented 4 years ago

@AlexeyAB Did "Forward inference time" include postprocess time? I test in my custom datasets(100 images), and results did not show any speed up. GPU is tesla V100, CUDA9.1 cudnn7.1.4

$ CUDA_VISIBLE_DEVICES=3 python test.py Try to load cfg: ./cfg/yolov3-voc.cfg, weights: backup416/yolov3-voc_8000.weights, clear = 0 compute_capability = 700, cudnn_half = 1

AlexeyAB commented 4 years ago

@phexic

Did "Forward inference time" include postprocess time?

No.

pre and post process time isn't included, because image loading, and image result saving may take too much time and CPU & HDD bounded.

phexic commented 4 years ago

@AlexeyAB Thank you for your explanation. Now i know pre and post time is not included in your test time. However performance is not any accelerated in my experiments，is it normal?

AlexeyAB commented 4 years ago

@phexic Look: https://github.com/AlexeyAB/darknet/issues/2365#issuecomment-462923756

dong7654 commented 4 years ago

I need to speed up yolo v3 inference using FP16 operations with FPGA implementation. Do I need to train new version of weights file for half-precision? or Can I just load the official yolov3.weights file and convert weights from fp32 to fp16 on memory? (Saving file storage is not my concern)

If I need to train fp16 version weights file, can I get it from just training with CUDNN_HALF=1 ?

AlexeyAB commented 4 years ago

Can I just load the official yolov3.weights file and convert weights from fp32 to fp16 on memory?

Yes, you can do it.

Ashwin-Ramesh2607 commented 4 years ago

@AlexeyAB Hey, when I am training using Tensor Cores it says the following message. Tensor Cores are disabled until the first 3000 iterations are reached. Since I am training the model itself for 3000 iterations, I would like to enable Tensor Core before 3000 iterations itself. How do you suggest I do that?

AlexeyAB commented 4 years ago

You shouldn't train 3000 iterations due to Readme

AlexeyAB / darknet

I added support for Tensor Cores, which speedup Detection and Training 3x on GPU CC >=7.0 #407