AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.72k stars 7.96k forks source link

Regarding mAP and latency of Yolov4 #5354

Open mive93 opened 4 years ago

mive93 commented 4 years ago

Dear Alexey,

first of all, thank you for your work. I have been doing some tests with your new yolov4, and I have some questions. I compared the performance of Yolov4, Yolov3 and CSPResNext50-Panet-SPP (the one I found also in your repo) on two different GPUs, using input size 416x416, and I have checked the mAP for the COCO2017 validation set.

Here are the results (both FPS and mAP have been computed using your code):

GeForce RTX 2080 Ti Rev. A (while training going on, so maybe perfromance are a bit degrated)
--------------------------------------------------------
            FPS mAP(.50, val COCO2017)
YOLOV3          39.0    66.16 % 
YOLOV4          38.8    70.22 %
CSPRESNEXT50-SPP-PANET  37.8    75.88 %
--------------------------------------------------------
GeForce GTX 1060 6GB
--------------------------------------------------------
            FPS     FPS(CUDNN_HALF = 1 )
YOLOV3          31.7    30.8
YOLOV4          29.9    29.0
CSPRESNEXT50-SPP-PANET  28.6    28.7      
--------------------------------------------------------

However, I have noticed that you do not compare with the third network in your paper. I was wondering which was the reason, and if I am doing, maybe, something wrong.

Thank you in advance.


image

WongKinYiu commented 4 years ago

Hello,

We use trainvalno5k set for training, there are some images in val set are trained. So CSPResNeXt50-PANet-SPP gets higher AP50 on val set may because that it is more fit to the training data.

The comparison of YOLOv4 (CSPDarknet53-PANet-SPP, BoF-backbone, Mish, optimal setting) and CSPResNeXt50-PANet-SPP are in Table 6. image

We choose CSPDarknet53 as backbone of YOLOv4 since it gets both higher FPS and AP. image

mive93 commented 4 years ago

Dear @WongKinYiu,

Thank you for your answer. I see why the mAP could be better, however I'm not experiencing that better results in FPS. I have tried again on the 2080Ti, with the GPU unloaded, and this is what I get:

Size 512x512
GeForce RTX 2080 Ti Rev. A 
--------------------------------------------------------
            FPS 
YOLOV3          55.5    
YOLOV4          60.5
CSRESNEXT50-SPP-PANET   59.6    

Therefore again, I don't see a big improvement in Yolov4. Again, maybe it's my fault, I would just like to understand why I do not obtain your improvement.

mive93 commented 4 years ago

And another thing, sorry I forgot, you said you use both training and validation for training, but you meant for CSPResNeXt50-PANet-SPP or for Yolov4?

Thanks again.

AlexeyAB commented 4 years ago

@mive93

using input size 416x416

mive93 commented 4 years ago

Dear @AlexeyAB,

Yes, I am sure the size were correct.

These are the commands I run to get the AVG FPS

./darknet detector demo cfg/coco.data  cfg/csresnext50-panet-spp.cfg  weights/csresnext50-panet-spp_final.weights  ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output
./darknet detector demo cfg/coco.data  cfg/yolov4.cfg  weights/yolov4.weights  ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output
./darknet detector demo cfg/coco.data  cfg/yolov3.cfg  weights/yolov3.weights  ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output

here the commands to obtain the detections for codalab

./darknet  detector valid cfg/coco.data cfg/csresnext50-panet-spp.cfg  weights/csresnext50-panet-spp_final.weights
./darknet  detector valid cfg/coco.data cfg/yolov4.cfg  weights/yolov4.weights
./darknet  detector valid cfg/coco.data cfg/yolov3.cfg  weights/yolov3.weights

In this folder you can find all the screenshots: https://cloud.hipert.unimore.it/s/g7KZNnytki5gExE

I summarize here the results from codalab on val2017

###############################################################################
#       YOLOV3 416x416 CODALAB res COCO2017 VAL                   #
###############################################################################
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.380
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.675
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.391
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.227
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.418
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.534
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.304
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.474
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.497
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.537
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.656
Done (t=81.34s)

GeForce RTX 2080 Ti Rev. A FPS: 75.5

###############################################################################
#       YOLOV4 416x416 CODALAB res COCO2017 VAL               #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)

GeForce RTX 2080 Ti Rev. A FPS: 71.7

###############################################################################
#       CSPRESNEXT50-PANET-SPP 416x416 CODALAB res COCO2017 VAL       #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.497
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.766
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.535
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.269
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.549
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.708
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.363
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.559
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.583
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.376
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.637
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.776
Done (t=78.67s)

GeForce RTX 2080 Ti Rev. A FPS: 70.1
AlexeyAB commented 4 years ago
AlexeyAB commented 4 years ago

Oh, you are about old model csresnext50-panet-spp.cfg not about csresnext50-panet-spp-original-optimal.cfg .

Yes, it seems csresnext50-panet-spp.cfg was trained by using trainvalno5k.list + 5k.list (may be), while csresnext50-panet-spp-original-optimal.cfg and yolov4.cfg were trained by using only trainvalno5k.list without 5k.list

I get this results on 5k.list:


By the way, I can't submit your json-files, so I just tested these models by myself again.

WongKinYiu commented 4 years ago

@AlexeyAB @mive93

I will test and update FPS on Turing architecture GPU in few days.

If use old CSPResNeXt50-PANet-SPP, you will get higher AP on 416x416 due to the anchor setting. https://github.com/AlexeyAB/darknet/issues/5311#issuecomment-619333864

mive93 commented 4 years ago

Dear @AlexeyAB and @WongKinYiu. Sorry for my late answer. I have uploaded on the folder both my json detections and the txt list: https://cloud.hipert.unimore.it/s/g7KZNnytki5gExE. To test everything on codalab I have just follower your wiki (just using COCOval2017 instead of testdev-2017).

However, if you say that you trained that network using also those data, it makes a lot of sense that the mAP is higher, even though it's not fair. So yeah, I assume Yolov4 accuracy is better then :)

Thank you for your quick answers, and thank you for clarifying my doubts. I will wait for the FPS results then.

mive93 commented 4 years ago

Dear @AlexeyAB,

yesterday I have ported your Yolov4 on tensorRT using tkDNN, a framework developed by @ceccocats, @sapienzadavide and I (you can find it here).

Some performance results on 2 boards, a discrete and an embedded one. The outputs match with yours, so the mAP is the same.

AVG FPS over 5000 images, input size 416x416.

AGX Xavier      
    FPS - FP32  FPS - FP16
yolov3  19,47       49,62
yolov4  17,52       32,67

RTX 2080 Ti     
    FPS - FP32  FPS - FP16
yolov3  106,30      192,13
yolov4  93,00       133,41
AlexeyAB commented 4 years ago

@mive93 Hi, Thanks!

The outputs match with yours, so the mAP is the same.

Does it match even for FP16? Is FP32 == FP32, while FP16 == Mixed Precision FP16+FP32 on TensorCores? Did you test it with batch=1? What network resolution did you use? What is the advantage of tkDNN over TensorRT, and for what tkDNN is used if TensorRT is used for inference/quantization? Is there some comparison table with FPS for different models/resolutions/float-precisions? https://github.com/ceccocats/tkDNN

Can you test YOLOv4 on RTX2080Ti (or preferably on Tesla V100) for 4 network resolutions with batch=1 and batch=4?

  1. 320
  2. 416
  3. 512
  4. 608
mive93 commented 4 years ago

Dear @AlexeyAB ,

Sorry for the delay.

###############################################################################
#       DARKNET 416x416 CODALAB res COCO2017 VAL                  #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)

###############################################################################
        TKDNN YOLOV4 FP32 416x416 CODALAB res COCO2017 VAL                
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.449
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.481
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.235
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.507
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.626
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.556
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.329
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.618
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.758
Done (t=73.61s)

###############################################################################
        TKDNN YOLOV4 FP16 416x416 CODALAB res COCO2017 VAL                
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.449
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.481
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.235
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.507
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.626
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.555
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.758
Done (t=72.04s)

Actually there is a small loss, I guess due to different implementation of the operations. I have tried to investigate more, but I couldn't find another source of mismatch. However, FP16 has the same mAP as FP32.

AlexeyAB commented 4 years ago

@mive93 Thanks!

FPS on RTX 2080Ti of Yolov4 TkDNN (avg over 1200 img of size 640 x 480)             
        FP32 - BATCH=1  FP32 - BATCH=4  FP16 - BATCH=1  FP16 - BATCH=4
yolov4 320  116,99      58,29       204,99      105,82
yolov4 416  116,27      40,68       194,64      71,08
yolov4 512  91,31       32,97       137,85      51,51
yolov4 608  62,04       20,27       109,01      37,60

Actually there is a small loss, I guess due to different implementation of the operations. I have tried to investigate more, but I couldn't find another source of mismatch. However, FP16 has the same mAP as FP32.

There are 3 different scale_x_y= values

  1. https://github.com/AlexeyAB/darknet/blob/f14054ec2b49440ad488c3e28612e7a76780bc5f/cfg/yolov4.cfg#L973
  2. https://github.com/AlexeyAB/darknet/blob/f14054ec2b49440ad488c3e28612e7a76780bc5f/cfg/yolov4.cfg#L1060
  3. https://github.com/AlexeyAB/darknet/blob/f14054ec2b49440ad488c3e28612e7a76780bc5f/cfg/yolov4.cfg#L1148
mive93 commented 4 years ago

Hi @AlexeyAB,

                pre(%)      inference(%)    post(%)
RTX2080Ti   yolov4 FP32 10,22       79,60       10,18
RTX2080Ti   yolov4 FP16 17,27       68,28       14,45

AGX Xavier  yolov4 FP32 2,58        95,36       2,06
AGX Xavier  yolov4 FP16 4,83        91,47       3,70
AlexeyAB commented 4 years ago

@mive93 Hi,

Performance results for other networks are not yet available, but we're submitting a paper next week. Then we will make them public. Even though anyone could reproduce them already.

Will you publish paper on arxiv.org with AP / FPS or only FPS comparison of different models?


float mish_activation(float input) { const float MISH_THRESHOLD = 20; output = input * tanh( softplus(input, MISH_THRESHOLD) ); return output; }

AlexeyAB commented 4 years ago

@mive93

Can you also test AVG_FPS for YOLOv4 on the Darknet (OpenCV + CUDA + cuDNN) on the same GPU 2080 Ti, for these network resolutions 320, 416, 512, 608?

By using such command: ./darknet detector demo cfg/coco.data cfg/yolov4.cfg weights/yolov4.weights yolo_test.mp4 -dont_show -ext_output

mive93 commented 4 years ago

Hi @AlexeyAB,

I am sorry for the delay but I have work-related deadline for this week. I will be back to you after those, trying to address your requests :) Sorry for now

mive93 commented 4 years ago

Hi @AlexeyAB, sorry for the long delay.

Will you publish paper on arxiv.org with AP / FPS or only FPS comparison of different models?

We submitted to a conference and we run experiments in terms of mAP, latency and power consumption. As soon as it's accepted we also plan to share the raw data.

Do you use only FP32 (without Tensor Cores) and FP16 (with Tensor Cores), but don't use FP32/16 (Mixed-precision with Tensor Cores), because FP16 shows the same good accuracy?

We use FP32 (without Tensor Cores) and FP32/16 (Mixed-precision with Tensor Cores) [as well as FP32/INT8], because plugins are always at FP32.

Will you add manual how to measure AP / AP50 and FPS by using TkDNN+TensorRT?

In the master there is already a demo that computes the mAP for each method supported by tkDNN. However, it is a bit different from your, because bounding boxes are accounted only one time, with the highest probability. In the README is explained how to compute the mAP, each precision is supported. Moreover, in the branch eval I added the export to a json, to evaluate the performance of each netowork/precision on codalab.

Will you add demo on video-file that check FPS including inference+pre+post_processing which are runing in 3 CPU-threads, and can use both batch=1 and batch=4 ? And shows detection results to the console and optionally shows video in the window (can be switched off, because can reduce FPS).

I will work on the demo with batch > 1 this week, will keep you updated when I'll have something working.

Did you compare inference time with batch=1 for tkDNN vs OpenCV-dnn? opencv/opencv#17148

Never heard of that, will take a look, thanks.

Do you use the same Mish-implementation as in the Darknet?

Yes

Can you also test AVG_FPS for YOLOv4 on the Darknet (OpenCV + CUDA + cuDNN) on the same GPU 2080 Ti, for these network resolutions 320, 416, 512, 608?

Here the results: Size FPS (avg)
320 100.6
416 82.5
512 69.7
608 53.6
AlexeyAB commented 4 years ago

@mive93 Hi, Thanks!

So, tkDNN accelerates yolov4 ~2x for batch=1 and 3x-4x for batch=4.

Size Darknet FPS (avg) tkDNN TensorRT FP32 FPS tkDNN TensorRT FP16 FPS tkDNN TensorRT FP16 batch=4 FPS Speedup
320 100.6 116 202 423 4.2x
416 82.5 103 162 284 3.5x
512 69.7 91 134 206 2.9x
608 53.6 62 100 150 2.8x

We submitted to a conference and we run experiments in terms of mAP, latency and power consumption. As soon as it's accepted we also plan to share the raw data.

When will the conference be?

Moreover, in the branch eval I added the export to a json, to evaluate the performance of each netowork/precision on codalab.

It would be great if you could get identical accuracy in the future like in Darknet.

AlexeyAB commented 4 years ago

@mive93 Hi,

We use new mish-implementation, and get +3% FPS with the same AP-detection accuracy on MSCOCO testdev: https://github.com/AlexeyAB/darknet/blob/bef28445e57cd560fa3d0a24af98a562d289135b/src/activation_kernels.cu#L235-L246

More: https://github.com/AlexeyAB/darknet/issues/5452#issuecomment-627414024

So you can try to use this implementation in tkDNN.

lazerliu commented 4 years ago

excuse me ,@mive93 how can you get this,what cmd did you use?thks!

###############################################################################
#       DARKNET 416x416 CODALAB res COCO2017 VAL                  #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)
mive93 commented 4 years ago

Hi @AlexeyAB The conference will be in September, and it covers not only tkDNN but also other stuff (we test 5 different CNN on 3 embedded boards with 3 different frameworks). We are considering also having an arxive only on tkDNN performances. When we'll have time, we'll probably do that. I will keep you updated if you're interested.

For the new mish function, I can include that. Thank you :)

@lazerliu I obtained those results using codalab. You have first to generate a json (in COCO format) of the detection, then submit it in the site. More info are given also in this repo wiki.

AlexeyAB commented 4 years ago

@lazerliu Create new topic.

lazerliu commented 4 years ago

@mive93 ,I just want to valid my dataset,thank [you!] @AlexeyAB thanks for your reply,I've created an issue #5615 about geting all the metrics of COCO mAP using this repository.

YashasSamaga commented 4 years ago

Results for OpenCV DNN @ master (https://github.com/opencv/opencv/tree/6b0fff72d9748345c6a079e4fce49af4130d8e12):

Device: RTX 2080 Ti

Input Size FP32 FPS FP16 FPS FP32 batch = 4 FP16 batch = 4
320 x 320 129.2 171.2 198 384
416 x 416 99.9 146 139.6 260.5
512 x 512 90.3 125.6 112.8 190.5
608 x 608 56 103.2 68.5 133

Code: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf

There are currently two open PRs which affect YOLOv4 performance. The performance will mostly improve by around 5-10%.


The timings often change slightly every time the benchmark program is run. Here is the raw output from the benchmark code:

CLICK ME 1 x 3 x 608 x 608: ``` YOLO v4 [CUDA FP32] init >> 463.515ms inference >> min = 16.964ms, max = 21.649ms, mean = 17.8347ms, stddev = 1.25985ms [CUDA FP16] init >> 311.645ms inference >> min = 9.644ms, max = 9.867ms, mean = 9.69076ms, stddev = 0.0379731ms ``` 4 x 3 x 608 x 608: ``` [CUDA FP32] init >> 625.919ms inference >> min = 57.811ms, max = 59.368ms, mean = 58.4264ms, stddev = 0.270633ms [CUDA FP16] init >> 523.272ms inference >> min = 29.902ms, max = 31.423ms, mean = 30.0806ms, stddev = 0.16901ms ``` 1 x 3 x 512 x 512: ``` YOLO v4 [CUDA FP32] init >> 432.214ms inference >> min = 10.87ms, max = 13.608ms, mean = 11.0792ms, stddev = 0.418999ms [CUDA FP16] init >> 318.978ms inference >> min = 7.934ms, max = 8.003ms, mean = 7.96052ms, stddev = 0.0138107ms ``` 4 x 3 x 512 x 512 ``` YOLO v4 [CUDA FP32] init >> 551.452ms inference >> min = 34.908ms, max = 41.57ms, mean = 35.4624ms, stddev = 0.886297ms [CUDA FP16] init >> 508.225ms inference >> min = 20.864ms, max = 21.621ms, mean = 21.0014ms, stddev = 0.111174ms ``` 1 x 3 x 416 x 416: ``` YOLO v4 [CUDA FP32] init >> 379.083ms inference >> min = 9.701ms, max = 12.643ms, mean = 10.0155ms, stddev = 0.679755ms [CUDA FP16] init >> 248.296ms inference >> min = 6.825ms, max = 6.91ms, mean = 6.85503ms, stddev = 0.0195312ms ``` 4 x 3 x 416 x 416 ``` YOLO v4 [CUDA FP32] init >> 462.255ms inference >> min = 28.082ms, max = 32.272ms, mean = 28.6683ms, stddev = 0.87224ms [CUDA FP16] init >> 386.791ms inference >> min = 15.25ms, max = 18.449ms, mean = 15.3566ms, stddev = 0.317417ms ``` 1 x 3 x 320 x 320: ``` YOLO v4 [CUDA FP32] init >> 377.244ms inference >> min = 7.506ms, max = 9.768ms, mean = 7.73995ms, stddev = 0.557712ms [CUDA FP16] init >> 250.421ms inference >> min = 5.826ms, max = 5.879ms, mean = 5.84173ms, stddev = 0.00956832ms ``` 4 x 3 x 320 x 320: ``` YOLO v4 [CUDA FP32] init >> 418.565ms inference >> min = 19.726ms, max = 24.998ms, mean = 20.1585ms, stddev = 0.504587ms [CUDA FP16] init >> 336.484ms inference >> min = 10.383ms, max = 10.5ms, mean = 10.4267ms, stddev = 0.0210358ms ```
AlexeyAB commented 4 years ago

@YashasSamaga

Can you add column, what FPS can you get by using Darknet GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 by using command ./darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights test.mp4 -ext_output -dont_show -benchmark

680 x 680 doesn't load on OpenCV DNN. It says inconsistent shape at some layer.

Why did you try 680x680? It should be multiple of 32.

YashasSamaga commented 4 years ago

Can you add column, what FPS can you get by using Darknet GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 by using command ./darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights test.mp4 -ext_output -dont_show -benchmark

I ran the command and waited for several seconds. I picked the best AVG_FPS that I saw. It increases initially and then seems to oscillate around an equilibrium value.

Input Size Darknet FP16 OCV FP32 FPS OCV FP16 FPS OCV FP32 batch = 4 OCV FP16 batch = 4
320 x 320 105.8 129.2 171.2 198 384
416 x 416 85.6 99.9 146 139.6 260.5
512 x 512 71.8 90.3 125.6 112.8 190.5
608 x 608 56.7 56 103.2 68.5 133

Why did you try 680x680? It should be multiple of 32.

Some of the tables in the comments in this issue had 680.

AlexeyAB commented 4 years ago

I fixed it )

YashasSamaga commented 4 years ago

I don't know if the timings for Darknet, tkDNN and OpenCV are for the same conditions/situation. Here is what the benchmark code I used does:

The OpenCV benchmark measures the time taken for net.forward to complete. The net.forward takes one blob as input and returns three cv::Mat. Each output corresponds to the detections from each [yolo] layer. NMS is not performed. The benchmark includes the time taken to transfer the input from the CPU to GPU as well as the time taken for the outputs to be transferred from the GPU to CPU. The timing endpoints are on CPU; the GPU end to end time will be slightly smaller (negligibly smaller mostly). OpenCV DNN does lazy initialization and hence the first forward pass is generally very slow. The benchmark ignores the first three runs.

AlexeyAB commented 4 years ago

So, tkDNN accelerates yolov4 ~2x for batch=1 and 3x-4x for batch=4. OpenCV-dnn is ~10% slower than tkDNN-TensorRT. tkDNN: https://github.com/ceccocats/tkDNN OpenCV: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf

Size Darknet FPS (avg) tkDNN TensorRT FP32 FPS tkDNN TensorRT FP16 FPS OpenCV FP16 FPS tkDNN TensorRT FP16 batch=4 FPS OpenCV FP16 batch=4 FPS tkDNN Speedup
320 100 116 202 171 423 384 4.2x
416 82 103 162 146 284 260 3.5x
512 69 91 134 125 206 190 2.9x
608 53 62 103 100 150 133 2.8x
YashasSamaga commented 4 years ago

I forgot to mention that I had set nms_threshold=0 in all [yolo] blocks in the configuration file. Otherwise, the NMS is done automatically in region layers.

RTX 2070S 608 x 608

without setting nms_threshold (opencv defaults to 0.2):

YOLO v4
[CUDA FP32]
init >> 1329.51ms
inference >> min = 45.596ms, max = 49.184ms, mean = 46.7278ms, stddev = 0.57918ms
[CUDA FP16]
init >> 865.449ms
inference >> min = 37.418ms, max = 43.093ms, mean = 39.4826ms, stddev = 1.24976ms

with nms_threshold=0 in all [yolo] blocks:

YOLO v4
[CUDA FP32]
        init >> 1245.76ms
        inference >> min = 29.934ms, max = 31.181ms, mean = 30.3622ms, stddev = 0.207436ms
[CUDA FP16]
        init >> 876.087ms
        inference >> min = 22.916ms, max = 28.212ms, mean = 24.5076ms, stddev = 1.09143ms

I have written an example which performs full NMS (not classwise) at the end instead of performing it three times during inference (which causes unnecessary context switches as NMS is performed on CPU). This barely changes the FPS.

AlexeyAB commented 4 years ago

@YashasSamaga Do you think we should request such improvement and switchable ability in OpenCV? to use

YashasSamaga commented 4 years ago

I have always wondered about the benefits of performing NMS in each yolo detection layer. Is there any advantage of doing so compared with doing one combined NMS at the end?

Doing the NMS at the end will definitely help improve performance of the OpenCV CUDA backend currently but I don't know how things will change once GPU NMS kernels are added (some work is in progress for DetectionOutput layer at https://github.com/opencv/opencv/pull/17301).

I think the best place such a thing could be introduced is in DetectionModel which is a part of the high-level model API that was recently introduced in OpenCV DNN.

AlexeyAB commented 4 years ago

I have always wondered about the benefits of performing NMS in each yolo detection layer. Is there any advantage of doing so compared with doing one combined NMS at the end?

I think no. Darknet uses 1 NMS for all yolo-layers.

YashasSamaga commented 4 years ago

I did a bit of investigation. YOLOv2 PR added NMS in region layer because there was only one region layer back then. YOLOv3 PR reused the region layer but this led to NMS being performed in each region layer. I think it's a bug which I thought was a feature all this time.

I have opened an issue https://github.com/opencv/opencv/issues/17415

mive93 commented 4 years ago

Hi @YashasSamaga, thank you for profiling OpenCV-dnn and comparing it with tkDNN also :)

In the last days we have released a new version of tkDNN, with also a darknet parser, the new mish, and the handling of the batches also for pre-post processing. But I haven't profiled it seriously yet. If interested, I can do it soonish.

mive93 commented 4 years ago

Hi :) On the README of tkDNN you can now find the performance of Yolov4 on different boards. Here's the screenshot of the table

image

YashasSamaga commented 4 years ago

Dataset and the list of images taken from How to evaluate accuracy and speed of YOLOv4.

Darknet as of e08a818 and original yolov4.cfg

Darknet (FP32)
GTX 1050

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.435
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.473
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.403
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.713
Done (t=487.90s)

The number of detections were considerably smaller in OpenCV. I eventually figured that OpenCV was ditching detections with low confidence scores. So I added thresh=0.001 to all [yolo] blocks in yolov4.cfg. The number of detections from Darknet and OpenCV still isn't same but they are better than before. I suspect the difference is caused by NMS method (OpenCV does global optimal NMS).

Code: https://gist.github.com/YashasSamaga/077a1d69c48e4cdb9957d167b7000b98

OpenCV DNN CUDA (FP32)
RTX 2080 Ti

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.436
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.474
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.405
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.715
Done (t=329.38s)
OpenCV DNN CUDA (FP16)
RTX 2080 Ti

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.435
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.473
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.532
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.404
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.714
Done (t=325.70s)

The numbers for OpenCV are better than Darknet. I think it's because of the NMS but I wanted to rule out the possibility of variations arising due to different choices made while selecting convolution kernels on different devices (Darknet stats were generated on GTX 1050 while OpenCV stats were generated on RTX 2080 Ti).

OpenCV FP32 on GTX 1050 ``` overall performance Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.436 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.657 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.474 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.342 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.549 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.580 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.405 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.715 Done (t=328.92s) ```
YashasSamaga commented 4 years ago

If I do not set thresh=0.001 in all [yolo] blocks, 0.2 is used as the confidence threshold. There is considerable performance degradation with using 0.2:

OpenCV CUDA FP16
RTX 2080 Ti
thresh = 0.2 (opencv default)

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.400
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.583
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.445
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.231
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.431
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.313
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.461
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.275
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.604
Done (t=117.94s)

I wonder if this default behaviour in OpenCV is correct.

AlexeyAB commented 4 years ago

@YashasSamaga This is normal, since for Detection you should use optimal conf-thresh 0.2 - 0.25, while AP calculation should be done for each possible conf-thresh starting from 0.001.

gachiemchiep commented 4 years ago

Hello @AlexeyAB. I try your model on OpenCV DNN and got the same result as reported. The output is a list of array which have format as :

batch_size x 17328 x 85 batch_size x 4332 x 85 batch_size x 1083 x 85

I understand that 85 is equal to [center_x, center_y, width, height, box_confidence, class_1_score, .... ]. Coco has 80 classes so it is 4 + 1 + 80 = 85. But what is 17328, 4332, 1083 is stand for? Would you mind give me a quick hint for this? Thanks.

WongKinYiu commented 4 years ago

they are grid_width * grid_height * masks * (classes+coordinates+objectness) so 1083 * 85 = 19 * 19 * 3 * (80 + 4 + 1).

gachiemchiep commented 4 years ago

@WongKinYiu But there're 3 of them : 17328, 4332, 1083 . Do you know the meaning of other 2?

WongKinYiu commented 4 years ago

there are three yolo layers (feature pyramid). 17328* 85 = 76 * 76 * 3 * (80 + 4 + 1) 4332 * 85 = 38 * 38 * 3 * (80 + 4 + 1) 1083 * 85 = 19 * 19 * 3 * (80 + 4 + 1)

gachiemchiep commented 4 years ago

@WongKinYiu Thank you for your kindness. It help me a lot. I checked the yolov3 and FPN paper and found the explanation about the feature pyramid.

YashasSamaga commented 4 years ago

@mive93 Do tkDNN benchmarks include the host/device memory transfer time?

I was looking at tkDNN source and if I have understood correctly, the input is copied from the host to device. The input on the device is then copied to TRT's device buffer, inference is done, the outputs in TRT's buffer are copied to non-TRT output buffers. The outputs are then copied to the host. The time reported by tkDNN is the time it took for copying from a device buffer to TRT's buffers and vice-versa and the inference time. Is this correct?

mive93 commented 4 years ago

Hi @YashasSamaga, What you said is correct. And yes, the tkDNN benchmarks I reported include only inference, preprocessing and postprocessing are left out.

AlexeyAB commented 4 years ago

@mive93 Do you use overlapping in 3 thread/steams?

  1. pre-processing (CPU -> GPU)
  2. inference on GPU 3 post-processing (GPU->CPU)

Yes, it reduces latency. But does it reduce FPS?

YashasSamaga commented 4 years ago

Btw OpenCV benchmarks that I reported had included the GPU-CPU transfer times. They total up to 1.1ms on RTX 2080 Ti (0.3ms for input, 0.53ms for output1, 0.25ms for output2 and 0.03ms for output3) for single image inference with pinned host memory. If this extra time is deducted from the OpenCV timings I reported, I think OpenCV is faster than tkDNN on RTX 2080 Ti for single image inference.

OpenCV master (as of today) takes 9.5ms for single image inference (inclusive of the 1.1ms) and tkDNN takes 9.0ms. Subtracting 1.1ms gives ~8.4ms for OpenCV but tkDNN is also making a device to device copy during inference which OpenCV doesn't but D2D copies are much faster (probably very very small compared to 1.1ms) than H2D or D2H copies.

Anyway, OpenCV and tkDNN are close enough that any benchmark will depend on these minute details. So it's not meaningful to compare with numbers very close to each other.

AlexeyAB commented 4 years ago

If 3 operations are overlaped, then they increase the latency, but do not affect the FPS.