Open AlexeyAB opened 6 years ago
More about Tensor Cores: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
More about Mixed-precision using Tensor Cores: https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
5x performance:
The same precision:
I did a little test - just trained my model using FP32 and FP32+FP16 - 3300 iterations:
Some informaton about implementation:
Loss Scaling
.CUDNN_DATA_FLOAT
computeType for convolutional descriptor to use Accumulation into FP32*_ALGO_WINOGRAD_NONFUSED
for FP32CUDNN_HALF
(FP32+16) then Tensor Cores can use algorithms *CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
for forward, CUDNN_CONVOLUTION_BWD_DATA_ALGO_1
for backward, CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1
for backward filter)
More info about using Tensor Cores with cuDNN: http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor_opsComparison number of Tensor Cores and common Float-32-Cores: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Volta_series
Tesla V100 contains 80 SMs x 8 Tensor Cores x 64 Float-16/32 Cores x 2 FMA x 1.2 GHz = 81 920 FP-operations per clock x 1.2 GHz = 98 Tflops-fp16/32 (Tensor cores)
Tesla V100 contains 80 SMs x 64 Float-32 Cores x 2 FMA x 1.2 GHz = 10 240 FP-operations per clock x 1.2 GHz = 12 Tflops-fp32
Each Tensor Core provides a 4x4x4 matrix processing and performs 64 floating point FMA mixed-precision operations per clock.
What do the Tensor Cores do:
I am not sure if this is related and it is my fault probably but I'm just gonna leave this here in case it can help somebody else. Yesterday I pulled this commit and with CUDA 9.0 and CuDNN 7.0 I enabled mixed-precision. I do not know anything about hardware so I apologize if this sounds really dumb. I restarted training yolo with my custom dataset from that and everything was running smoothly. After a couple of hours, my computer turned off on its own whilst I was having lunch. I came to realize that my nvidia gpu is not recognized anymore by the computer and is basically dead.
I am just leaving this here for future reference for people who like me do not realize the implications of the misuse of hardware :cry:
@FilipaRamos What GPU did you use? This is either a coincidence, or an overheat, or a serious bug in the drivers or nVidia hardware.
For the fastest training using mixed-precision with defined CUDNN_HALF
as described here: https://github.com/AlexeyAB/darknet/issues/407#issue-300029052
I recommend to use only GPU Volta (Tesla V100, TITAN V, ...).
Or you can use Amazon EC2 servers p3.*
with GPU Volta Tesla V100: https://aws.amazon.com/ec2/pricing/on-demand/?nc1=h_ls
More about GPU Volta on EC2: https://aws.amazon.com/ru/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/
@AlexeyAB My GPU is GeForce GTX 950M so yeah I know, it was yolo. It's not a software problem, it's just that I pushed the hardware too much. I've opened up my laptop and I managed to bring the gpu back to life. Still I think it won't last for too long. My gpu-z logs show that it was an overheat issue. Guess this is an opportunity to upgrade my hardware
Appreciate for your work. ^_^. Have a issue here...
Training with FDDB for face detection, config file: yolo.cfg, cuda 9.1, cudnn 7.1.1(7.1 also...), driver 390, V100 Without CUDNN_HALF, there is no problem with training. Once enable CUDNN_HALF, I will get this immediately: Region Avg IOU: 0.000000, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000, count: 10
@adonishong Thanks, something wrong in the source code, I'll check it.
appreciate for your work, and looking forward to your update ^_^
Has anyone gotten an inference performance boost with FP16 precision on the Jetson TX2 platform? Using Cuda 9 and cuDNN 7 on the TX2 I do not see any performance difference between using an FP16 based build compared to an FP32. Does the model have to be in FP16 format in order to get a speedup?
I have the same problem with Titan V
30 conv 125 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 125
31 detection
mask_scale: Using default '1.000000'
Loading weights from weights/darknet19_448.conv.23...
seen 32
Done!
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Resizing
480
try to allocate workspace = 42348544 * sizeof(float), CUDA allocate done!
Loaded: 0.000052 seconds
Region Avg IOU: 0.261185, Class: nan, Obj: 0.208333, No Obj: 0.250000, Avg Recall: 0.250000, count: 12
Region Avg IOU: 0.129104, Class: nan, Obj: 0.142857, No Obj: 0.250000, Avg Recall: 0.142857, count: 14
Region Avg IOU: 0.262299, Class: nan, Obj: 0.233333, No Obj: 0.250000, Avg Recall: 0.400000, count: 15
Region Avg IOU: 0.282436, Class: nan, Obj: 0.338710, No Obj: 0.250000, Avg Recall: 0.161290, count: 31
Region Avg IOU: 0.297492, Class: nan, Obj: 0.238095, No Obj: 0.250000, Avg Recall: 0.333333, count: 21
Region Avg IOU: 0.268952, Class: nan, Obj: 0.230769, No Obj: 0.250000, Avg Recall: 0.307692, count: 13
Region Avg IOU: 0.296595, Class: nan, Obj: 0.250000, No Obj: 0.250000, Avg Recall: 0.500000, count: 8
Region Avg IOU: 0.212846, Class: nan, Obj: 0.153846, No Obj: 0.250000, Avg Recall: 0.307692, count: 13
1: -nan, -nan avg, 0.000000 rate, 4.897003 seconds, 64 images
Loaded: 0.000059 seconds
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000, count: 15
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000, count: 22
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000, count: 10
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000, count: 11
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000, count: 12
Region Avg IOU: nan, Class: nan, Obj: 0.000000, No Obj: 0.000000, Avg Recall: 0.000000, count: 12
Without CUDNN_HALF everything is fine.
@AndreasAakerberg Did you ever figure it out? I wanted to use this for faster inference on a P100 but it's not making any difference. I checked the code and the weights are only converted to 16 bit right before they are loaded into the gpu so I would think you can use any previously trained model.
@kanrrra No i did not find any solutions. I am considering trying darkflow as the latest TensorFlow now has integrated TensorRT which should enable us to use FP16 and graph optimization with darknet (https://devblogs.nvidia.com/tensorrt-integration-speeds-tensorflow-inference/?ncid=so-lin-gc18tt-35202&_lrsc=8d1a9273-e1f6-41df-9edc-219bcc76b9af&ncid=so-lin-lt-798).
Same problem with nan
showing up everywhere when training with P100 and compiling with CUDNN_HALF
@abagshaw @adonishong @lamerman @AndreasAakerberg @kanrrra @FilipaRamos
I fixed few bugs to use Tensor Cores on GPU Tesla V100 (Volta) for detection and training:
CUDA 9.1, cuDNN 7 for CUDA 9.1, OpenCV 3.4.0
GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1
in the Makefile
(OpenCV can give ~2x speedup of training neural network with high resolution width=1024 height=1024
or higher due to 3.5x times accelerated data augmentation using OpenCV )
Detection - Forward inference time:
Model | FP32 (Tesla V100), sec | Tensor Cores FP16/32 (Tesla V100), sec | Speedup X times |
---|---|---|---|
yolov3.cfg | 0.031 | 0.011 | 2.8x |
yolo-voc.2.0.cfg | 0.02 | 0.0062 | 3.2x |
tiny-yolo-voc.cfg | 0.003 | 0.0027 | 1.1x |
batch=64 subdivision=4 width=416 height=416 random=0
):Model | FP32 (Tesla V100), sec | Tensor Cores FP16/32 (Tesla V100), sec | Speedup X times |
---|---|---|---|
yolov3.cfg | 1.9 | 1.27 | 1.5x |
yolo-voc.2.0.cfg | 0.89 | 0.39 | 2.3x |
tiny-yolo-voc.cfg | 0.23 | 0.165 | 1.4x |
Tiny-yolo can not be much accelerated, because it has a very small size. Yolov3 isn't accelerated more for training because has too much layers, but in the current implementation data transfer between layers is using FP32.
I successfully trained on Tesla V100 (Tensor Cores FP16/32) my custom tiny-model for very small objects (size of objects 1x1 - 50x50) 6000 iterations:
Avg loss is about ~= 0.11 on 6000 iteration.
detections_count = 43384, unique_truth_count = 3012
class_id = 0, name = air, ap = 65.16 %
class_id = 1, name = bird, ap = 60.53 %
for thresh = 0.25, precision = 0.65, recall = 0.68, F1-score = 0.67
for thresh = 0.25, TP = 2058, FP = 1094, FN = 954, average IoU = 44.22 %
mean average precision (mAP) = 0.628447, or 62.84 %
Accuracy (mAP) ~ is the same as using FP32-cores.
@AlexeyAB Thanks for all your work on this! Too bad speedup isn't quite 5x, maybe the title of this issue should be changed 😄.
@abagshaw Yes, 4-5x times only for Convolutional layer, but for the entire network only 3x times speedup )
I tried your patched version and it seems to work 2 times faster on Titan V. Thank you very much for your work.
Accelerated by another 5% using mixed-precision FP16/32 Batch-normalization for training on Tensor Cores. Loss (0.11) and mAP ~= the same.
detections_count = 47229, unique_truth_count = 3012
class_id = 0, name = air, ap = 67.74 %
class_id = 1, name = bird, ap = 66.08 %
for thresh = 0.25, precision = 0.67, recall = 0.72, F1-score = 0.70
for thresh = 0.25, TP = 2177, FP = 1075, FN = 835, average IoU = 45.44 %
mean average precision (mAP) = 0.669062, or 66.91 %
thanks for you work ^_^ @AlexeyAB
Thanks @AlexeyAB. Can I train the model on GTX 1080 Ti and do inference using Titan V ( I have this need to predict on HD at real time though I can compromise on my training time and hence the question.)
@kmsravindra Yes.
You can train on the GTX 1080 Ti with CUDNN_HALF=0
And then use this weights-file for detection on Titan V with CUDNN_HALF=1
Thank you!
@AlexeyAB Thanks for your work. I got 2x times speed up with GPU Tesla V100 (Volta) . But with nan showing up everywhere. Using your fixed few bugs version i cannot successful compilation.
I must use CUDA 9.1 and OpenCV 3.4.0?
@315386775
Using your fixed few bugs version i cannot successful compilation.
So you can't compile latest version of this repository, isn't it?
Yes, you should use CUDA 9.1 and OpenCV 3.4.0
@AlexeyAB , As per this https://en.wikipedia.org/wiki/GeForce_20_series,
RTX 2080Ti seems to give an extra processing boost of GFLOPS using half-precision - 23500 (26896) compared to single precision.
Is there any setting that I can use to run yolov3 with half-precision converted weights just like I did for INT8 yolo2_light? Will INT8 version on RTX 2080Ti also speeds up the FPS compared to GTX 1080Ti?
@kmsravindra
You can use this repository with GPU=1 CUDNN=1 CUDNN_HALF=1
to use Tensor Cores FP16 on RTX 2080 Ti
I didn't add implicit implementation of FP16 to the https://github.com/AlexeyAB/yolo2_light but I added flag for automatic using of Tensor Cores with automatic conversion FP32->FP16->FP32 - if you use cuDNN >= 7.2
Also to achive the highest speed ~220 Tops - INT8x32 on Tensor Cores, so you can try to use if un-comment these lines:
But as described here, INT8x32 is supported only on CC 7.2 Xavier (not higher or lower) https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips
Note: Note that this CUDNN_DATA_INT8x32 is only supported by sm_72.
Also I get run-time error when I try to create any cuDNN descriptor for CUDNN_DATA_INT8x32: https://github.com/AlexeyAB/yolo2_light/issues/18#issuecomment-432452312
@AlexeyAB , Thank you very much
I currently have an 1080ti which does 24FPS with YOLOv3. Does the 2080ti really do 3 times as much with Tensorcores when you say 3x speedup ?
@AlexeyAB any luck progress on the CUDNN_DATA_INT8x32, I checked with Nvidia, they claim the sm_75 supports Tensor Core INT8. For Tensor core INT8, the data type should be CUDNN_DATA_INT8x32 and filter format should be CUDNN_TENSOR_NCHW_VECT_C. Tensor core INT8 requires the Channel/Kernel size must be aligned by 32.the CUDNN sample code is in the conv_sample.cpp does pass the INT8x32.
@billmguo
Tensor core INT8 requires the Channel/Kernel size must be aligned by 32.the CUDNN sample code is in the conv_sample.cpp does pass the INT8x32.
As I know only Channel
must be aligned by 32, not kernel_size, since it usually 1x1 - 3x3.
Can you quote Doc or source code where is stated about "Kernel size must be aligned by 32"?
I tried it here, but I even can't create cudnn descriptor with INT8x32
and filters=128, channels=128, kernel_w=3, kernel_h=3: https://github.com/AlexeyAB/yolo2_light/issues/18#issuecomment-433146751
Hi Alexey
Nvidia confirm this is a known issue for 4DDescriptor. Please use NdDescriptor for vectorized data input
Thanks/Min
On Mon, Jan 14, 2019 at 12:47 PM Alexey notifications@github.com wrote:
@billmguo https://github.com/billmguo
Tensor core INT8 requires the Channel/Kernel size must be aligned by 32.the CUDNN sample code is in the conv_sample.cpp does pass the INT8x32.
As I know only Channel must be aligned by 32, not kernel_size, since it usually 1x1 - 3x3. Can you quote Doc or source code where is stated about "Kernel size must be aligned by 32"?
I tried it here, but I even can't create cudnn descriptor with INT8x32 and filters=128, channels=128, kernel_w=3, kernel_h=3: AlexeyAB/yolo2_light#18 (comment) https://github.com/AlexeyAB/yolo2_light/issues/18#issuecomment-433146751
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexeyAB/darknet/issues/407#issuecomment-454155388, or mute the thread https://github.com/notifications/unsubscribe-auth/AKNSgBDZt5L9CFiB6AGrDTWS_y0WRu2eks5vDOzMgaJpZM4SSTsa .
@kmsravindra Yes. You can train on the GTX 1080 Ti with
CUDNN_HALF=0
And then use this weights-file for detection on Titan V withCUDNN_HALF=1
Hi Alexey, A quick question...Can we do the other way round? Train a model with CUDNN_HALF=1 and use that model to predict on a different machine with CUDNN_HALF=0 ?
thanks!
@kmsravindra Yes.
@AlexeyAB Did "Forward inference time" include postprocess time? I test in my custom datasets(100 images), and results did not show any speed up. GPU is tesla V100, CUDA9.1 cudnn7.1.4
$ CUDA_VISIBLE_DEVICES=3 python test.py Try to load cfg: ./cfg/yolov3-voc.cfg, weights: backup416/yolov3-voc_8000.weights, clear = 0 compute_capability = 700, cudnn_half = 1
@phexic
Did "Forward inference time" include postprocess time?
No.
pre and post process time isn't included, because image loading, and image result saving may take too much time and CPU & HDD bounded.
@AlexeyAB Thank you for your explanation. Now i know pre and post time is not included in your test time. However performance is not any accelerated in my experiments,is it normal?
I need to speed up yolo v3 inference using FP16 operations with FPGA implementation. Do I need to train new version of weights file for half-precision? or Can I just load the official yolov3.weights file and convert weights from fp32 to fp16 on memory? (Saving file storage is not my concern)
If I need to train fp16 version weights file, can I get it from just training with CUDNN_HALF=1 ?
Can I just load the official yolov3.weights file and convert weights from fp32 to fp16 on memory?
Yes, you can do it.
@AlexeyAB Hey, when I am training using Tensor Cores it says the following message.
Tensor Cores are disabled until the first 3000 iterations are reached.
Since I am training the model itself for 3000 iterations, I would like to enable Tensor Core before 3000 iterations itself. How do you suggest I do that?
You shouldn't train 3000 iterations due to Readme
I added support for Tensor Cores, which should speed up Detection and Training 3x times on GPU since Volta-architecture (Nvidia TITAN V (V100), ...) with CC >= 7.0 and using CUDA >= 9.0 and cuDNN >= 7.0.
The use of Tensor Cores will be turned on automatically using Float-32bit - if your GPU supports it.
But if you want to speedup more using Mixed-precision (FP32+FP16), then you should set define
CUDNN_HALF
on Windows: open
\darknet.sln
-> (right click on project) -> properties -> C/C++ -> Preprocessor -> Preprocessor Definitions, and add at the beginning of line:CUDNN_HALF;
on Linux: open
Makefile
and setCUDNN_HALF=1
: https://github.com/AlexeyAB/darknet/blob/140333977cea0ba9e384cd38fd01013a8915ef60/Makefile#L3For training on Amazon EC2
p3.2xlarge - p3.16xlarge
or remote DGX-2/1 server use flag-dont_show
:./darknet detector train data/obj.data yolo-obj.cfg darknet53.conv.74 -dont_show
batch=64 subdivision=4 width=416 height=416 random=0
):Tiny-yolo can not be much accelerated, because it has a very small size. Yolov3 isn't accelerated more for training because it has too much layers, and in the current implementation the data transfer between layers uses FP32.
For Amazon EC2 use:
p3.2xlarge
1 x Tesla V100 -p3.16xlarge
DGX-2 (8 x Tesla V100 with nvLink) (link)Deep Learning Base AMI (Ubuntu) Version 3.0 (ami-38c87440) - Deep Learning base AMI with NVidia drivers like CUDA 8 and 9, CuDNN 6 and 7, CuBLAS 8 and 9, NCCL