Open mive93 opened 4 years ago
Hello,
We use trainvalno5k set for training, there are some images in val set are trained. So CSPResNeXt50-PANet-SPP gets higher AP50 on val set may because that it is more fit to the training data.
The comparison of YOLOv4 (CSPDarknet53-PANet-SPP, BoF-backbone, Mish, optimal setting) and CSPResNeXt50-PANet-SPP are in Table 6.
We choose CSPDarknet53 as backbone of YOLOv4 since it gets both higher FPS and AP.
Dear @WongKinYiu,
Thank you for your answer. I see why the mAP could be better, however I'm not experiencing that better results in FPS. I have tried again on the 2080Ti, with the GPU unloaded, and this is what I get:
Size 512x512
GeForce RTX 2080 Ti Rev. A
--------------------------------------------------------
FPS
YOLOV3 55.5
YOLOV4 60.5
CSRESNEXT50-SPP-PANET 59.6
Therefore again, I don't see a big improvement in Yolov4. Again, maybe it's my fault, I would just like to understand why I do not obtain your improvement.
And another thing, sorry I forgot, you said you use both training and validation for training, but you meant for CSPResNeXt50-PANet-SPP or for Yolov4?
Thanks again.
@mive93
using input size 416x416
Check that you use [net] width=416 height=416
in botch cfg-files
What command did you use for checking AP and FPS?
Can you show screenshot from https://competitions.codalab.org where do you get 0.75
AP50 on val2017?
Can you show screenshot with FPS?
Dear @AlexeyAB,
Yes, I am sure the size were correct.
These are the commands I run to get the AVG FPS
./darknet detector demo cfg/coco.data cfg/csresnext50-panet-spp.cfg weights/csresnext50-panet-spp_final.weights ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output
./darknet detector demo cfg/coco.data cfg/yolov4.cfg weights/yolov4.weights ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output
./darknet detector demo cfg/coco.data cfg/yolov3.cfg weights/yolov3.weights ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output
here the commands to obtain the detections for codalab
./darknet detector valid cfg/coco.data cfg/csresnext50-panet-spp.cfg weights/csresnext50-panet-spp_final.weights
./darknet detector valid cfg/coco.data cfg/yolov4.cfg weights/yolov4.weights
./darknet detector valid cfg/coco.data cfg/yolov3.cfg weights/yolov3.weights
In this folder you can find all the screenshots: https://cloud.hipert.unimore.it/s/g7KZNnytki5gExE
I summarize here the results from codalab on val2017
###############################################################################
# YOLOV3 416x416 CODALAB res COCO2017 VAL #
###############################################################################
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.380
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.675
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.391
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.227
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.418
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.534
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.304
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.474
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.497
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.537
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.656
Done (t=81.34s)
GeForce RTX 2080 Ti Rev. A FPS: 75.5
###############################################################################
# YOLOV4 416x416 CODALAB res COCO2017 VAL #
###############################################################################
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.471
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.710
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.510
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.357
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.561
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.587
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)
GeForce RTX 2080 Ti Rev. A FPS: 71.7
###############################################################################
# CSPRESNEXT50-PANET-SPP 416x416 CODALAB res COCO2017 VAL #
###############################################################################
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.497
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.766
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.535
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.269
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.549
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.708
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.363
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.559
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.583
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.376
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.637
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.776
Done (t=78.67s)
GeForce RTX 2080 Ti Rev. A FPS: 70.1
val2017.list
with list of coco2017 validation imagesOh, you are about old model csresnext50-panet-spp.cfg
not about csresnext50-panet-spp-original-optimal.cfg
.
Yes, it seems csresnext50-panet-spp.cfg
was trained by using trainvalno5k.list
+ 5k.list
(may be), while csresnext50-panet-spp-original-optimal.cfg
and yolov4.cfg
were trained by using only trainvalno5k.list
without 5k.list
I get this results on 5k.list
:
yolov4.cfg
- 416x416 - val2017:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.471
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.710
csresnext50-panet-spp-original-optimal.cfg
- 416x416 - val2017:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.457
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.693
csresnext50-panet-spp.cfg
- 416x416 - val2017:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.497
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.766
By the way, I can't submit your json-files, so I just tested these models by myself again.
@AlexeyAB @mive93
I will test and update FPS on Turing architecture GPU in few days.
If use old CSPResNeXt50-PANet-SPP, you will get higher AP on 416x416 due to the anchor setting. https://github.com/AlexeyAB/darknet/issues/5311#issuecomment-619333864
Dear @AlexeyAB and @WongKinYiu. Sorry for my late answer. I have uploaded on the folder both my json detections and the txt list: https://cloud.hipert.unimore.it/s/g7KZNnytki5gExE. To test everything on codalab I have just follower your wiki (just using COCOval2017 instead of testdev-2017).
However, if you say that you trained that network using also those data, it makes a lot of sense that the mAP is higher, even though it's not fair. So yeah, I assume Yolov4 accuracy is better then :)
Thank you for your quick answers, and thank you for clarifying my doubts. I will wait for the FPS results then.
Dear @AlexeyAB,
yesterday I have ported your Yolov4 on tensorRT using tkDNN, a framework developed by @ceccocats, @sapienzadavide and I (you can find it here).
Some performance results on 2 boards, a discrete and an embedded one. The outputs match with yours, so the mAP is the same.
AVG FPS over 5000 images, input size 416x416.
AGX Xavier
FPS - FP32 FPS - FP16
yolov3 19,47 49,62
yolov4 17,52 32,67
RTX 2080 Ti
FPS - FP32 FPS - FP16
yolov3 106,30 192,13
yolov4 93,00 133,41
@mive93 Hi, Thanks!
The outputs match with yours, so the mAP is the same.
Does it match even for FP16? Is FP32 == FP32, while FP16 == Mixed Precision FP16+FP32 on TensorCores? Did you test it with batch=1? What network resolution did you use? What is the advantage of tkDNN over TensorRT, and for what tkDNN is used if TensorRT is used for inference/quantization? Is there some comparison table with FPS for different models/resolutions/float-precisions? https://github.com/ceccocats/tkDNN
Can you test YOLOv4 on RTX2080Ti (or preferably on Tesla V100) for 4 network resolutions with batch=1
and batch=4
?
Dear @AlexeyAB ,
Sorry for the delay.
###############################################################################
# DARKNET 416x416 CODALAB res COCO2017 VAL #
###############################################################################
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.471
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.710
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.510
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.357
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.561
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.587
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)
###############################################################################
TKDNN YOLOV4 FP32 416x416 CODALAB res COCO2017 VAL
###############################################################################
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.449
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.701
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.481
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.235
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.507
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.626
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.343
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.533
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.556
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.329
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.618
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.758
Done (t=73.61s)
###############################################################################
TKDNN YOLOV4 FP16 416x416 CODALAB res COCO2017 VAL
###############################################################################
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.449
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.701
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.481
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.235
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.507
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.626
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.343
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.533
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.555
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.758
Done (t=72.04s)
Actually there is a small loss, I guess due to different implementation of the operations. I have tried to investigate more, but I couldn't find another source of mismatch. However, FP16 has the same mAP as FP32.
FP32 is full precision, FP16 is half precision for tensorRT. Then with tkDNN is also possible to infer at INT8 (always int8 tensorRT quantization) and using DLA. However, given that in these kinds of networks a lot of layers are implemented via plugins and not native TensorRT APIs, the performance with DLA or INT8 could be degraded.
Yes, I did the tests with batch 1
I was using 416x416
TkDNN uses tensorRT. We just tried to achieve the best performances exploiting it, while not depending on deepstream for example. We also compared our performance with deepstream (to be precise last autumn), and we perform better. In this table batch=1 is considered and the same video was used to compute the avg fps. Darknet is the one by Redmon. The considered board here is the NVIDIA Tx2.
Performance results for other networks are not yet available, but we're submitting a paper next week. Then we will make them public. Even though anyone could reproduce them already.
Here there are the results of the tests you asked for:
FPS on RTX 2080Ti of Yolov4 TkDNN (avg over 1200 img of size 640 x 480)
FP32 - BATCH=1 FP32 - BATCH=4 FP16 - BATCH=1 FP16 - BATCH=4
yolov4 320 116,99 58,29 204,99 105,82
yolov4 416 116,27 40,68 194,64 71,08
yolov4 512 91,31 32,97 137,85 51,51
yolov4 608 62,04 20,27 109,01 37,60
@mive93 Thanks!
FPS on RTX 2080Ti of Yolov4 TkDNN (avg over 1200 img of size 640 x 480)
FP32 - BATCH=1 FP32 - BATCH=4 FP16 - BATCH=1 FP16 - BATCH=4
yolov4 320 116,99 58,29 204,99 105,82
yolov4 416 116,27 40,68 194,64 71,08
yolov4 512 91,31 32,97 137,85 51,51
yolov4 608 62,04 20,27 109,01 37,60
Does 37.60
FPS for batch=4 actually mean that tkDNN process 37.6 x 4 = 150,4
FPS for YOLOv4 width=608 height=608 batch_size=4 FP16 on RTX 2080 Ti?
Usually high batch size increases FPS.
Do you measure just inference time, or do you measure full cycle fps? Just pre(resizing) and post(NMS) processing execute in separate CPU-threads asynchronously, therefore, do not reduce FPS?
Actually there is a small loss, I guess due to different implementation of the operations. I have tried to investigate more, but I couldn't find another source of mismatch. However, FP16 has the same mAP as FP32.
Do you use resizing before inference without keeping aspect ratio? This repo https://github.com/AlexeyAB/darknet doesn't keep aspect ratio (i.e. by default letter_box=0), while https://github.com/pjreddie/darknet keeps aspect ratio.
Just do cv::resize(src, dst, Size(608,608));
without keeping aspect ratio
https://github.com/AlexeyAB/darknet/issues/232#issuecomment-336955485
Also what NMS implementation do you use? (is it regular NMS or soft-NMS) https://github.com/AlexeyAB/darknet/blob/36c73c5b9e3f2e72049fb68566e32632f6c70e85/src/box.c#L812-L844
Did you implement scale_x_y=
in the [yolo]
layer? It is very simple addition:
There are 3 different scale_x_y=
values
Hi @AlexeyAB,
FPS on RTX 2080Ti of Yolov4 TkDNN (avg over 1200 img of size = to network inputsize)
FP32 - BATCH=1 FP32 - BATCH=4 FP16 - BATCH=1 FP16 - BATCH=4
yolov4 320 116,56 233,16 202,02 423,29
yolov4 416 103,54 162,71 162,50 284,34
yolov4 512 91,63 131,90 134,94 206,04
yolov4 608 62,34 81,06 100,81 150,41
pre(%) inference(%) post(%)
RTX2080Ti yolov4 FP32 10,22 79,60 10,18
RTX2080Ti yolov4 FP16 17,27 68,28 14,45
AGX Xavier yolov4 FP32 2,58 95,36 2,06
AGX Xavier yolov4 FP16 4,83 91,47 3,70
We already do so.
We use the NM you reported.
Yes we have implemented the scale_xy for yolov4.
@mive93 Hi,
Performance results for other networks are not yet available, but we're submitting a paper next week. Then we will make them public. Even though anyone could reproduce them already.
Will you publish paper on arxiv.org with AP / FPS or only FPS comparison of different models?
Do you use only FP32 (without Tensor Cores) and FP16 (with Tensor Cores), but don't use FP32/16 (Mixed-precision with Tensor Cores), because FP16 shows the same good accuracy?
Will you add manual how to measure AP / AP50 and FPS by using TkDNN+TensorRT?
Will you add demo on video-file that check FPS including inference+pre+post_processing which are runing in 3 CPU-threads, and can use both batch=1 and batch=4 ? And shows detection results to the console and optionally shows video in the window (can be switched off, because can reduce FPS).
Did you compare inference time with batch=1 for tkDNN vs OpenCV-dnn? https://github.com/opencv/opencv/issues/17148
Do you use the same Mish-implementation as in the Darknet? https://github.com/AlexeyAB/darknet/blob/f14054ec2b49440ad488c3e28612e7a76780bc5f/src/activation_kernels.cu#L238
float softplus(float x, float threshold = 20) {
if (x > threshold) return x; // too large
else if (x < -threshold) return expf(x); // too small
return logf(expf(x) + 1);
}
float mish_activation(float input) { const float MISH_THRESHOLD = 20; output = input * tanh( softplus(input, MISH_THRESHOLD) ); return output; }
@mive93
Can you also test AVG_FPS for YOLOv4 on the Darknet (OpenCV + CUDA + cuDNN) on the same GPU 2080 Ti, for these network resolutions 320, 416, 512, 608?
By using such command:
./darknet detector demo cfg/coco.data cfg/yolov4.cfg weights/yolov4.weights yolo_test.mp4 -dont_show -ext_output
Hi @AlexeyAB,
I am sorry for the delay but I have work-related deadline for this week. I will be back to you after those, trying to address your requests :) Sorry for now
Hi @AlexeyAB, sorry for the long delay.
Will you publish paper on arxiv.org with AP / FPS or only FPS comparison of different models?
We submitted to a conference and we run experiments in terms of mAP, latency and power consumption. As soon as it's accepted we also plan to share the raw data.
Do you use only FP32 (without Tensor Cores) and FP16 (with Tensor Cores), but don't use FP32/16 (Mixed-precision with Tensor Cores), because FP16 shows the same good accuracy?
We use FP32 (without Tensor Cores) and FP32/16 (Mixed-precision with Tensor Cores) [as well as FP32/INT8], because plugins are always at FP32.
Will you add manual how to measure AP / AP50 and FPS by using TkDNN+TensorRT?
In the master there is already a demo that computes the mAP for each method supported by tkDNN. However, it is a bit different from your, because bounding boxes are accounted only one time, with the highest probability. In the README is explained how to compute the mAP, each precision is supported. Moreover, in the branch eval I added the export to a json, to evaluate the performance of each netowork/precision on codalab.
Will you add demo on video-file that check FPS including inference+pre+post_processing which are runing in 3 CPU-threads, and can use both batch=1 and batch=4 ? And shows detection results to the console and optionally shows video in the window (can be switched off, because can reduce FPS).
I will work on the demo with batch > 1 this week, will keep you updated when I'll have something working.
Did you compare inference time with batch=1 for tkDNN vs OpenCV-dnn? opencv/opencv#17148
Never heard of that, will take a look, thanks.
Do you use the same Mish-implementation as in the Darknet?
Yes
Can you also test AVG_FPS for YOLOv4 on the Darknet (OpenCV + CUDA + cuDNN) on the same GPU 2080 Ti, for these network resolutions 320, 416, 512, 608?
Here the results: | Size | FPS (avg) |
---|---|---|
320 | 100.6 | |
416 | 82.5 | |
512 | 69.7 | |
608 | 53.6 |
@mive93 Hi, Thanks!
So, tkDNN accelerates yolov4 ~2x for batch=1 and 3x-4x for batch=4.
Size | Darknet FPS (avg) | tkDNN TensorRT FP32 FPS | tkDNN TensorRT FP16 FPS | tkDNN TensorRT FP16 batch=4 FPS | Speedup |
---|---|---|---|---|---|
320 | 100.6 | 116 | 202 | 423 | 4.2x |
416 | 82.5 | 103 | 162 | 284 | 3.5x |
512 | 69.7 | 91 | 134 | 206 | 2.9x |
608 | 53.6 | 62 | 100 | 150 | 2.8x |
We submitted to a conference and we run experiments in terms of mAP, latency and power consumption. As soon as it's accepted we also plan to share the raw data.
When will the conference be?
Moreover, in the branch eval I added the export to a json, to evaluate the performance of each netowork/precision on codalab.
It would be great if you could get identical accuracy in the future like in Darknet.
@mive93 Hi,
We use new mish-implementation, and get +3% FPS with the same AP-detection accuracy on MSCOCO testdev: https://github.com/AlexeyAB/darknet/blob/bef28445e57cd560fa3d0a24af98a562d289135b/src/activation_kernels.cu#L235-L246
More: https://github.com/AlexeyAB/darknet/issues/5452#issuecomment-627414024
So you can try to use this implementation in tkDNN.
excuse me ,@mive93 how can you get this,what cmd did you use?thks!
###############################################################################
# DARKNET 416x416 CODALAB res COCO2017 VAL #
###############################################################################
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.471
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.710
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.510
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.357
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.561
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.587
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)
Hi @AlexeyAB The conference will be in September, and it covers not only tkDNN but also other stuff (we test 5 different CNN on 3 embedded boards with 3 different frameworks). We are considering also having an arxive only on tkDNN performances. When we'll have time, we'll probably do that. I will keep you updated if you're interested.
For the new mish function, I can include that. Thank you :)
@lazerliu I obtained those results using codalab. You have first to generate a json (in COCO format) of the detection, then submit it in the site. More info are given also in this repo wiki.
@lazerliu Create new topic.
@mive93 ,I just want to valid my dataset,thank [you!] @AlexeyAB thanks for your reply,I've created an issue #5615 about geting all the metrics of COCO mAP using this repository.
Results for OpenCV DNN @ master (https://github.com/opencv/opencv/tree/6b0fff72d9748345c6a079e4fce49af4130d8e12):
Device: RTX 2080 Ti
Input Size | FP32 FPS | FP16 FPS | FP32 batch = 4 | FP16 batch = 4 |
---|---|---|---|---|
320 x 320 | 129.2 | 171.2 | 198 | 384 |
416 x 416 | 99.9 | 146 | 139.6 | 260.5 |
512 x 512 | 90.3 | 125.6 | 112.8 | 190.5 |
608 x 608 | 56 | 103.2 | 68.5 | 133 |
Code: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf
There are currently two open PRs which affect YOLOv4 performance. The performance will mostly improve by around 5-10%.
The timings often change slightly every time the benchmark program is run. Here is the raw output from the benchmark code:
@YashasSamaga
Can you add column, what FPS can you get by using Darknet GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 by using command
./darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights test.mp4 -ext_output -dont_show -benchmark
680 x 680 doesn't load on OpenCV DNN. It says inconsistent shape at some layer.
Why did you try 680x680? It should be multiple of 32.
Can you add column, what FPS can you get by using Darknet GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 by using command
./darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights test.mp4 -ext_output -dont_show -benchmark
I ran the command and waited for several seconds. I picked the best AVG_FPS that I saw. It increases initially and then seems to oscillate around an equilibrium value.
Input Size | Darknet FP16 | OCV FP32 FPS | OCV FP16 FPS | OCV FP32 batch = 4 | OCV FP16 batch = 4 |
---|---|---|---|---|---|
320 x 320 | 105.8 | 129.2 | 171.2 | 198 | 384 |
416 x 416 | 85.6 | 99.9 | 146 | 139.6 | 260.5 |
512 x 512 | 71.8 | 90.3 | 125.6 | 112.8 | 190.5 |
608 x 608 | 56.7 | 56 | 103.2 | 68.5 | 133 |
Why did you try 680x680? It should be multiple of 32.
Some of the tables in the comments in this issue had 680.
I fixed it )
I don't know if the timings for Darknet, tkDNN and OpenCV are for the same conditions/situation. Here is what the benchmark code I used does:
The OpenCV benchmark measures the time taken for net.forward
to complete. The net.forward
takes one blob as input and returns three cv::Mat
. Each output corresponds to the detections from each [yolo]
layer. NMS is not performed. The benchmark includes the time taken to transfer the input from the CPU to GPU as well as the time taken for the outputs to be transferred from the GPU to CPU. The timing endpoints are on CPU; the GPU end to end time will be slightly smaller (negligibly smaller mostly). OpenCV DNN does lazy initialization and hence the first forward pass is generally very slow. The benchmark ignores the first three runs.
So, tkDNN accelerates yolov4 ~2x for batch=1 and 3x-4x for batch=4. OpenCV-dnn is ~10% slower than tkDNN-TensorRT. tkDNN: https://github.com/ceccocats/tkDNN OpenCV: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf
Size | Darknet FPS (avg) | tkDNN TensorRT FP32 FPS | tkDNN TensorRT FP16 FPS | OpenCV FP16 FPS | tkDNN TensorRT FP16 batch=4 FPS | OpenCV FP16 batch=4 FPS | tkDNN Speedup |
---|---|---|---|---|---|---|---|
320 | 100 | 116 | 202 | 171 | 423 | 384 | 4.2x |
416 | 82 | 103 | 162 | 146 | 284 | 260 | 3.5x |
512 | 69 | 91 | 134 | 125 | 206 | 190 | 2.9x |
608 | 53 | 62 | 103 | 100 | 150 | 133 | 2.8x |
I forgot to mention that I had set nms_threshold=0
in all [yolo]
blocks in the configuration file. Otherwise, the NMS is done automatically in region layers.
RTX 2070S 608 x 608
without setting nms_threshold
(opencv defaults to 0.2):
YOLO v4
[CUDA FP32]
init >> 1329.51ms
inference >> min = 45.596ms, max = 49.184ms, mean = 46.7278ms, stddev = 0.57918ms
[CUDA FP16]
init >> 865.449ms
inference >> min = 37.418ms, max = 43.093ms, mean = 39.4826ms, stddev = 1.24976ms
with nms_threshold=0
in all [yolo]
blocks:
YOLO v4
[CUDA FP32]
init >> 1245.76ms
inference >> min = 29.934ms, max = 31.181ms, mean = 30.3622ms, stddev = 0.207436ms
[CUDA FP16]
init >> 876.087ms
inference >> min = 22.916ms, max = 28.212ms, mean = 24.5076ms, stddev = 1.09143ms
I have written an example which performs full NMS (not classwise) at the end instead of performing it three times during inference (which causes unnecessary context switches as NMS is performed on CPU). This barely changes the FPS.
@YashasSamaga Do you think we should request such improvement and switchable ability in OpenCV? to use
I have always wondered about the benefits of performing NMS in each yolo detection layer. Is there any advantage of doing so compared with doing one combined NMS at the end?
Doing the NMS at the end will definitely help improve performance of the OpenCV CUDA backend currently but I don't know how things will change once GPU NMS kernels are added (some work is in progress for DetectionOutput layer at https://github.com/opencv/opencv/pull/17301).
I think the best place such a thing could be introduced is in DetectionModel which is a part of the high-level model API that was recently introduced in OpenCV DNN.
I have always wondered about the benefits of performing NMS in each yolo detection layer. Is there any advantage of doing so compared with doing one combined NMS at the end?
I think no. Darknet uses 1 NMS for all yolo-layers.
I did a bit of investigation. YOLOv2 PR added NMS in region layer because there was only one region layer back then. YOLOv3 PR reused the region layer but this led to NMS being performed in each region layer. I think it's a bug which I thought was a feature all this time.
I have opened an issue https://github.com/opencv/opencv/issues/17415
Hi @YashasSamaga, thank you for profiling OpenCV-dnn and comparing it with tkDNN also :)
In the last days we have released a new version of tkDNN, with also a darknet parser, the new mish, and the handling of the batches also for pre-post processing. But I haven't profiled it seriously yet. If interested, I can do it soonish.
Hi :) On the README of tkDNN you can now find the performance of Yolov4 on different boards. Here's the screenshot of the table
Dataset and the list of images taken from How to evaluate accuracy and speed of YOLOv4.
Darknet as of e08a818 and original yolov4.cfg
Darknet (FP32)
GTX 1050
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.435
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.657
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.473
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.342
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.549
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.580
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.403
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.713
Done (t=487.90s)
The number of detections were considerably smaller in OpenCV. I eventually figured that OpenCV was ditching detections with low confidence scores. So I added thresh=0.001
to all [yolo]
blocks in yolov4.cfg
. The number of detections from Darknet and OpenCV still isn't same but they are better than before. I suspect the difference is caused by NMS method (OpenCV does global optimal NMS).
Code: https://gist.github.com/YashasSamaga/077a1d69c48e4cdb9957d167b7000b98
OpenCV DNN CUDA (FP32)
RTX 2080 Ti
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.436
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.657
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.474
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.342
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.549
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.580
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.405
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.715
Done (t=329.38s)
OpenCV DNN CUDA (FP16)
RTX 2080 Ti
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.435
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.657
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.473
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.532
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.342
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.549
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.580
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.404
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.714
Done (t=325.70s)
The numbers for OpenCV are better than Darknet. I think it's because of the NMS but I wanted to rule out the possibility of variations arising due to different choices made while selecting convolution kernels on different devices (Darknet stats were generated on GTX 1050 while OpenCV stats were generated on RTX 2080 Ti).
If I do not set thresh=0.001
in all [yolo]
blocks, 0.2
is used as the confidence threshold. There is considerable performance degradation with using 0.2
:
OpenCV CUDA FP16
RTX 2080 Ti
thresh = 0.2 (opencv default)
overall performance
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.400
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.583
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.445
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.231
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.431
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.313
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.461
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.471
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.275
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.604
Done (t=117.94s)
I wonder if this default behaviour in OpenCV is correct.
@YashasSamaga This is normal, since for Detection you should use optimal conf-thresh 0.2 - 0.25, while AP calculation should be done for each possible conf-thresh starting from 0.001.
Hello @AlexeyAB. I try your model on OpenCV DNN and got the same result as reported. The output is a list of array which have format as :
batch_size x 17328 x 85 batch_size x 4332 x 85 batch_size x 1083 x 85
I understand that 85 is equal to [center_x, center_y, width, height, box_confidence, class_1_score, .... ]. Coco has 80 classes so it is 4 + 1 + 80 = 85. But what is 17328, 4332, 1083 is stand for? Would you mind give me a quick hint for this? Thanks.
they are grid_width * grid_height * masks * (classes+coordinates+objectness)
so 1083 * 85 = 19 * 19 * 3 * (80 + 4 + 1)
.
@WongKinYiu But there're 3 of them : 17328, 4332, 1083 . Do you know the meaning of other 2?
there are three yolo layers (feature pyramid).
17328* 85 = 76 * 76 * 3 * (80 + 4 + 1)
4332 * 85 = 38 * 38 * 3 * (80 + 4 + 1)
1083 * 85 = 19 * 19 * 3 * (80 + 4 + 1)
@WongKinYiu Thank you for your kindness. It help me a lot. I checked the yolov3 and FPN paper and found the explanation about the feature pyramid.
@mive93 Do tkDNN benchmarks include the host/device memory transfer time?
I was looking at tkDNN source and if I have understood correctly, the input is copied from the host to device. The input on the device is then copied to TRT's device buffer, inference is done, the outputs in TRT's buffer are copied to non-TRT output buffers. The outputs are then copied to the host. The time reported by tkDNN is the time it took for copying from a device buffer to TRT's buffers and vice-versa and the inference time. Is this correct?
Hi @YashasSamaga, What you said is correct. And yes, the tkDNN benchmarks I reported include only inference, preprocessing and postprocessing are left out.
@mive93 Do you use overlapping in 3 thread/steams?
Yes, it reduces latency. But does it reduce FPS?
Btw OpenCV benchmarks that I reported had included the GPU-CPU transfer times. They total up to 1.1ms on RTX 2080 Ti (0.3ms for input, 0.53ms for output1, 0.25ms for output2 and 0.03ms for output3) for single image inference with pinned host memory. If this extra time is deducted from the OpenCV timings I reported, I think OpenCV is faster than tkDNN on RTX 2080 Ti for single image inference.
OpenCV master (as of today) takes 9.5ms for single image inference (inclusive of the 1.1ms) and tkDNN takes 9.0ms. Subtracting 1.1ms gives ~8.4ms for OpenCV but tkDNN is also making a device to device copy during inference which OpenCV doesn't but D2D copies are much faster (probably very very small compared to 1.1ms) than H2D or D2H copies.
Anyway, OpenCV and tkDNN are close enough that any benchmark will depend on these minute details. So it's not meaningful to compare with numbers very close to each other.
If 3 operations are overlaped, then they increase the latency, but do not affect the FPS.
Dear Alexey,
first of all, thank you for your work. I have been doing some tests with your new yolov4, and I have some questions. I compared the performance of Yolov4, Yolov3 and CSPResNext50-Panet-SPP (the one I found also in your repo) on two different GPUs, using input size 416x416, and I have checked the mAP for the COCO2017 validation set.
Here are the results (both FPS and mAP have been computed using your code):
However, I have noticed that you do not compare with the third network in your paper. I was wondering which was the reason, and if I am doing, maybe, something wrong.
Thank you in advance.