AlexeyAB commented 4 years ago

Discussion: https://www.reddit.com/r/MachineLearning/comments/hu7lyt/p_yolov4tiny_speed_1770_fps_tensorrtbatch4/

Full structure: structure of yolov4-tiny.cfg model

YOLOv4-tiny released: 40.2% AP50, 371 FPS (GTX 1080 Ti) / 330 FPS (RTX 2070)

1770 FPS - on GPU RTX 2080Ti - (416x416, fp16, batch=4) tkDNN/TensorRT https://github.com/ceccocats/tkDNN/issues/59#issuecomment-652269964
1353 FPS - on GPU RTX 2080Ti - (416x416, fp16, batch=4) OpenCV 4.4.0 (including: transfering CPU->GPU and GPU->CPU) (excluding: nms, pre/post-processing) https://github.com/AlexeyAB/darknet/issues/6067#issuecomment-656604015
39 FPS - 25ms latency - on Jetson Nano - (416x416, fp16, batch=1) tkDNN/TensorRT https://github.com/ceccocats/tkDNN/issues/59#issuecomment-652157334
290 FPS - 3.5ms latency - on Jetson AGX - (416x416, fp16, batch=1) tkDNN/TensorRT https://github.com/ceccocats/tkDNN/issues/59#issuecomment-652157334
42 FPS - on CPU Core i7 7700HQ (4 Cores / 8 Logical Cores) - (416x416, fp16, batch=1) OpenCV 4.4.0 (compiled with OpenVINO backend) https://github.com/AlexeyAB/darknet/issues/6067#issuecomment-656693529
20 FPS on CPU ARM Kirin 990 - Smartphone Huawei P40 https://github.com/AlexeyAB/darknet/issues/6091#issuecomment-651502121 - Tencent/NCNN library https://github.com/Tencent/ncnn
120 FPS on nVidia Jetson AGX Xavier - MAX_N - Darknet framework
371 FPS on GPU GTX 1080 Ti - Darknet framework

cmp

source	yolov3-tiny (800x800)	yolov4-tiny (800x800)

CSTEZCAN commented 4 years ago

I know that Alex is telling the truth. Can you test on my system? i want to see this

you can install it from here; https://github.com/ceccocats/tkDNN and use models described above with COCO images from COCO dataset website..

vinorth-v commented 4 years ago

To detect small objects you must also use 3-yolo-layers in yolov4-tiny.

Hello, can you please share the cfg file for this?

spinlud commented 4 years ago

Hi guys! How this compares to yolo_v3_tiny_pan3_aa_ae_mixup_scale_giou.cfg model in speed/accuracy?

marcusbrito commented 4 years ago

To detect small objects you must also use 3-yolo-layers in yolov4-tiny.

I think a 3-yolo-layers yolov4-tiny would be a game changer for my application. If someone can make the change and retrain in the COCO dataset, I would appreciate it very much.

AlexeyAB commented 4 years ago

Comparison of accuracy of Yolov3-tiny (width=800 height=800) vs Yolov4-tiny (width=800 height=800)

source	yolov3-tiny (800x800)	yolov4-tiny (800x800)

choochtech commented 4 years ago

Great work @AlexeyAB !

wwzh2015 commented 4 years ago

Comparison of accuracy of Yolov3-tiny (width=800 height=800) vs Yolov4-tiny (width=800 height=800)

source yolov3-tiny (800x800) yolov4-tiny (800x800)

Hi, Can you share the class Dataset?

AlexeyAB commented 4 years ago

@wwzh2015 This is the default yolov4-tiny model that is trained on http://mscoco.org/ dataset (just set width=800 height=800 in cfg-files): https://github.com/AlexeyAB/darknet#pre-trained-models

rchglev commented 4 years ago

Opencv 4.4.0 can load this model in your thought? Now it failed to build. Thank you.

kermado commented 4 years ago

This is amazing, but it fails to detect small objects. My use case only has small objects (about 10 - 30 pixels). Should I regenerate anchors or use a third yolo layer? If the latter, how can I change the cfg file to add a third layer? Thanks!

wwzh2015 commented 4 years ago

This is amazing, but it fails to detect small objects. My use case only has small objects (about 10 - 30 pixels). Should I regenerate anchors or use a third yolo layer? If the latter, how can I change the cfg file to add a third layer? Thanks!

I think use a third yolo layer that is exactly.

awaisbajwaml commented 4 years ago

Comparison of accuracy of Yolov3-tiny (width=800 height=800) vs Yolov4-tiny (width=800 height=800)

source yolov3-tiny (800x800) yolov4-tiny (800x800)

@AlexeyAB world-class! in coming years it will only improve.

ZZQ-sanmenxia commented 4 years ago

371 FPS on GPU GTX 1080 Ti - Darknet framework
Is the input resulation 419*416 can perform 371 FPS on GPU GTX 1080 Ti?

YashasSamaga commented 4 years ago

OpenCV uses cuDNN and does not support TensorRT. OpenCV will be slower on low-end devices as convolutions become a bigger bottleneck where TensorRT based solutions will outperform cuDNN based solutions. The CUDA backend in OpenCV is just six months old. So be sure to check again in the future.

Prefer tkDNN if your device is not a high-end GPU. Use tkDNN if you need INT8 precision. On high-end device, if tiny performance gains matter a lot, try both OpenCV and tkDNN.

You can extract much higher FPS on low-end devices using a pipeline with detection, tracking, etc. You can find an outdated example here.

CUDA version: 10.2 cuDNN version: 7.6.5

Benchmark Code: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf

YOLOv4

Performance

Numbers in the tables indicate FPS.

Device: RTX 2080 Ti

Input Size	OCV CUDA FP32 (batch = 1)	OCV CUDA FP32 (batch = 4)	OCV CUDA FP16 (batch = 1)	OCV CUDA FP16 (batch = 4)
320 x 320	137	208	183	430
416 x 416	106	148	159	294
512 x 512	95	121	138	216
608 x 608	60	72	115	149

Device: GTX 1080 Ti

Input Size	OCV CUDA FP32 (batch = 1)	OCV CUDA FP32 (batch = 4)
320 x 320	87	177
416 x 416	75	116
512 x 512	64	87
608 x 608	48	58

stats for batch = 1 on RTX 2080 Ti

``` YOLO v4 608x608 [CUDA FP32] init >> 454.679ms inference >> min = 16.485ms, max = 16.921ms, mean = 16.6191ms, stddev = 0.0808131ms [CUDA FP16] init >> 300.069ms inference >> min = 8.651ms, max = 8.683ms, mean = 8.66774ms, stddev = 0.00478416ms YOLO v4 512x512 [CUDA FP32] init >> 406.6ms inference >> min = 10.374ms, max = 11.83ms, mean = 10.4926ms, stddev = 0.143843ms [CUDA FP16] init >> 275.52ms inference >> min = 7.192ms, max = 8.503ms, mean = 7.22117ms, stddev = 0.128906ms YOLO v4 416x416 [CUDA FP32] init >> 367.613ms inference >> min = 9.316ms, max = 11.175ms, mean = 9.41672ms, stddev = 0.181294ms [CUDA FP16] init >> 372.56ms inference >> min = 6.282ms, max = 6.31ms, mean = 6.29363ms, stddev = 0.00552427ms YOLO v4 320x320 [CUDA FP32] init >> 334.725ms inference >> min = 7.251ms, max = 7.374ms, mean = 7.28856ms, stddev = 0.0288375ms [CUDA FP16] init >> 341.046ms inference >> min = 5.424ms, max = 8.38ms, mean = 5.46248ms, stddev = 0.293226ms ```

stats for batch = 4 on RTX 2080 Ti

``` YOLO v4 608x608 [CUDA FP32] init >> 766.004ms inference >> min = 54.923ms, max = 55.624ms, mean = 55.3088ms, stddev = 0.115878ms [CUDA FP16] init >> 637.243ms inference >> min = 26.779ms, max = 27.024ms, mean = 26.8849ms, stddev = 0.0541266ms YOLO v4 512x512 [CUDA FP32] init >> 612.49ms inference >> min = 32.767ms, max = 33.538ms, mean = 33.0728ms, stddev = 0.166096ms [CUDA FP16] init >> 477.195ms inference >> min = 18.458ms, max = 18.646ms, mean = 18.5382ms, stddev = 0.0484753ms YOLO v4 416x416 [CUDA FP32] init >> 513.95ms inference >> min = 26.658ms, max = 27.231ms, mean = 26.8539ms, stddev = 0.132812ms [CUDA FP16] init >> 500.416ms inference >> min = 13.607ms, max = 13.644ms, mean = 13.6244ms, stddev = 0.00956832ms YOLO v4 320x320 [CUDA FP32] init >> 420.021ms inference >> min = 19.025ms, max = 19.451ms, mean = 19.2406ms, stddev = 0.075745ms [CUDA FP16] init >> 310.825ms inference >> min = 9.307ms, max = 9.356ms, mean = 9.32954ms, stddev = 0.0113886ms ```

stats for batch = 1 on GTX 1080 Ti

``` YOLO v4 608x608 [CUDA FP32] init >> 448.645ms inference >> min = 20.607ms, max = 21.236ms, mean = 20.815ms, stddev = 0.145216ms YOLO v4 512x512 [CUDA FP32] init >> 400.154ms inference >> min = 15.615ms, max = 18.014ms, mean = 15.7292ms, stddev = 0.258522ms YOLO v4 416x416 [CUDA FP32] init >> 362.689ms inference >> min = 13.141ms, max = 15.595ms, mean = 13.246ms, stddev = 0.25ms YOLO v4 320x320 [CUDA FP32] init >> 332.818ms inference >> min = 11.404ms, max = 11.838ms, mean = 11.5086ms, stddev = 0.107688ms ```

stats for batch = 4 on GTX 1080 Ti

``` YOLO v4 608x608 [CUDA FP32] init >> 754.766ms inference >> min = 68.059ms, max = 70.152ms, mean = 69.2089ms, stddev = 0.466139ms YOLO v4 512x512 [CUDA FP32] init >> 616.897ms inference >> min = 45.413ms, max = 46.848ms, mean = 46.0973ms, stddev = 0.310541ms YOLO v4 416x416 [CUDA FP32] init >> 505.867ms inference >> min = 33.985ms, max = 34.727ms, mean = 34.4462ms, stddev = 0.152694ms YOLO v4 320x320 [CUDA FP32] init >> 412.886ms inference >> min = 22.386ms, max = 23.136ms, mean = 22.5925ms, stddev = 0.133614ms ```

Accuracy

Calculated using the dataset and list from How to evaluate accuracy and speed of YOLOv4

Code: https://gist.github.com/YashasSamaga/077a1d69c48e4cdb9957d167b7000b98

Note: thresh=0.001 was added to all [yolo] blocks in yolov4.cfg

Device: RTX 2080 Ti

Darknet FP32

OCV CUDA FP32

OCV CUDA FP16

YOLOv4 Tiny

Numbers in the tables indicate FPS.

Device: RTX 2080 Ti

Input Size	OCV CUDA FP32 (batch = 1)	OCV CUDA FP32 (batch = 4)	OCV CUDA FP16 (batch = 1)	OCV CUDA FP16 (batch = 4)
416 x 416	754	973	773	1353

Device: GTX 1080 Ti

Input Size	OCV CUDA FP32 (batch = 1)	OCV CUDA FP32 (batch = 4)
416 x 416	557	792

stats for batch = 1 on RTX 2080 Ti

``` YOLO v4 Tiny [CUDA FP32] init >> 22.774ms inference >> min = 1.316ms, max = 1.333ms, mean = 1.32554ms, stddev = 0.0030688ms [CUDA FP16] init >> 24.165ms inference >> min = 1.286ms, max = 1.305ms, mean = 1.29334ms, stddev = 0.00341797ms ```

stats for batch = 4 on RTX 2080 Ti

``` YOLO v4 Tiny [CUDA FP32] init >> 30.706ms inference >> min = 4.101ms, max = 4.135ms, mean = 4.11251ms, stddev = 0.00458048ms [CUDA FP16] init >> 30.023ms inference >> min = 2.951ms, max = 2.965ms, mean = 2.95713ms, stddev = 0.00218366ms ```

stats for batch = 1 on GTX 1080 Ti

``` YOLO v4 Tiny [CUDA FP32] init >> 29.717ms inference >> min = 1.782ms, max = 1.824ms, mean = 1.79643ms, stddev = 0.00887007ms ```

stats for batch = 4 on GTX 1080 Ti

``` YOLO v4 Tiny [CUDA FP32] init >> 41.984ms inference >> min = 5.027ms, max = 5.509ms, mean = 5.05131ms, stddev = 0.0477418ms ```

GPU-CPU data transfer and comparing with tkDNN

Timings reported by tkDNN do not include the time spent in transferring data between CPU and GPU. If I have understood correctly, tkDNN allows you to manually manage the transfer process. You can overlap the data transfer process with inference and hide the data transfer costs completely. Therefore, you can achieve the performance that tkDNN reports even if you have/need data on CPU.

Timings reported in this post include the transfer time which makes a significant chunk of the inference time. OpenCV doesn't allow you to control the data transfer process nor does it allow you to provide cv::cuda::GpuMat as input. Hence, you won't be able to easily hide the data transfer time. You can mitigate it partially or fully by using multiple cv::dnn::Net objects. Using multiple cv::dnn::Net objects also has the additional benefit of keeping the GPU busy always (i.e. you will reduce GPU idle time between two inference workloads).

OpenCV partially mitigates the data transfer cost if your network has multiple outputs. The GPU to CPU transfer of output begins immediately when the output becomes available. YOLOv4 608x608 has three output blobs and they take 0.5ms, 0.3ms and 0.04ms for data transfer on RTX 2080 Ti (input transfer takes 0.36ms more). OpenCV begins transferring the 0.5ms output while the GPU is busy computing the 0.4ms and 0.04ms outputs. This way OpenCV completely hides the transfer cost of 0.5ms output and 0.36ms output. Overall, only the transfer of input from CPU to GPU and the transfer of the last output from GPU to CPU is visible in the benchmarks (which together are still significant). This mitigation strategy doesn't work well with YOLOv4 Tiny. OpenCV computes the smaller output blobs first and then the largest output blob. Hence, the gains from this mitigation strategy aren't as high as YOLOv4. It's possible to hack around with the order of the layers in yolov4-tiny.cfg and somehow trick the importer to schedule the layers such that the largest output blob is computed first.

If you're interested, here are the extra data transfer costs that were incurred. To calculate the inference only FPS, you have to deduct this from the mean inference time reported and then calculate the FPS.

GTX 1080 Ti excess for YOLOv4 608x608 (batch = 1): 0.4ms (0.36ms input + 0.04ms last output) GTX 1080 Ti excess for YOLOv4Tiny 416x416 (batch = 1): 0.22ms (0.17ms input + 0.055ms last output)

RTX 2080 Ti excess for YOLOv4 608x608 (batch = 1): 0.4ms (0.36ms input + 0.04ms last output) RTX 2080 Ti excess for YOLOv4 608x608 (batch = 4): 1.65ms (1.52ms input + 0.12ms last output) RTX 2080 Ti excess for YOLOv4Tiny 416x416 (batch = 1): 0.23ms (0.17ms input + 0.05ms last output) RTX 2080 Ti excess for YOLOv4Tiny 416x416 (batch = 4): 0.93ms (0.72ms input + 0.2ms last output)

For example, for YOLOv4 Tiny on RTX 2080 Ti: 2.957ms - 0.93ms = 2.027ms which is 0.507ms per batch item (1972 FPS).

It's not very meaningful to do this procedure. The data transfer cost becomes negligible as you use low-end devices (the computation takes most of the time).

mmaaz60 commented 4 years ago

This is what you can expect in the upcoming OpenCV 4.4 release.

CUDA version: 10.2 cuDNN version: 7.6.5

Benchmark Code: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf

YOLOv4

Performance

Device: RTX 2080 Ti

Input Size OCV CUDA FP32 (batch = 1) OCV CUDA FP32 (batch = 4) OCV CUDA FP16 (batch = 1) OCV CUDA FP16 (batch = 4) 320 x 320 137 208 183 430 416 x 416 106 148 159 294 512 x 512 95 121 138 216 608 x 608 60 72 115 149 Device: GTX 1080 Ti

Input Size OCV CUDA FP32 (batch = 1) OCV CUDA FP32 (batch = 4) 320 x 320 87 177 416 x 416 75 116 512 x 512 64 87 608 x 608 48 58 stats for batch = 1 on RTX 2080 Ti stats for batch = 4 on RTX 2080 Ti stats for batch = 1 on GTX 1080 Ti stats for batch = 4 on GTX 1080 Ti

Accuracy

Calculated using the dataset and list from How to evaluate accuracy and speed of YOLOv4

Code: https://gist.github.com/YashasSamaga/077a1d69c48e4cdb9957d167b7000b98

Note: thresh=0.001 was added to all [yolo] blocks in yolov4.cfg

Device: RTX 2080 Ti

Darknet FP32 OCV CUDA FP32 OCV CUDA FP16

YOLOv4 Tiny

Device: RTX 2080 Ti

Input Size OCV CUDA FP32 (batch = 1) OCV CUDA FP32 (batch = 4) OCV CUDA FP16 (batch = 1) OCV CUDA FP16 (batch = 4) 416 x 416 754 973 773 1353 Device: GTX 1080 Ti

Input Size OCV CUDA FP32 (batch = 1) OCV CUDA FP32 (batch = 4) 416 x 416 557 792 stats for batch = 1 on RTX 2080 Ti stats for batch = 4 on RTX 2080 Ti stats for batch = 1 on GTX 1080 Ti stats for batch = 4 on GTX 1080 Ti

GPU-CPU data transfer and comparing with tkDNN

Timings reported by tkDNN do not include the time spent in transferring data between CPU and GPU. If I have understood correctly, tkDNN allows you to manually manage the transfer process. You can overlap the data transfer process with inference and hide the data transfer costs completely. Therefore, you can achieve the performance that tkDNN reports even if you have/need data on CPU.

Timings reported in this post include the transfer time which makes a significant chunk of the inference time. OpenCV doesn't allow you to control the data transfer process nor does it allow you to provide cv::cuda::GpuMat as input. Hence, you won't be able to easily hide the data transfer time. You can mitigate it partially or fully by using multiple cv::dnn::Net objects. Using multiple cv::dnn::Net objects also has the additional benefit of keeping the GPU busy always (i.e. you will reduce GPU idle time between two inference workloads).

OpenCV partially mitigates the data transfer cost if your network has multiple outputs. The GPU to CPU transfer of output begins immediately when the output becomes available. YOLOv4 608x608 has three output blobs and they take 0.5ms, 0.3ms and 0.04ms for data transfer on RTX 2080 Ti (input transfer takes 0.36ms more). OpenCV begins transferring the 0.5ms output while the GPU is busy computing the 0.4ms and 0.04ms outputs. This way OpenCV completely hides the transfer cost of 0.5ms output and 0.36ms output. Overall, only the transfer of input from CPU to GPU and the transfer of the last output from GPU to CPU is visible in the benchmarks (which together are still significant). This mitigation strategy doesn't work well with YOLOv4 Tiny. OpenCV computes the smaller output blobs first and then the largest output blob. Hence, the gains from this mitigation strategy aren't as high as YOLOv4. It's possible to hack around with the order of the layers in yolov4-tiny.cfg and somehow trick the importer to schedule the layers such that the largest output blob is computed first.

If you're interested, here are the extra data transfer costs that were incurred. To calculate the inference only FPS, you have to deduct this from the mean inference time reported and then calculate the FPS.

GTX 1080 Ti excess for YOLOv4 608x608 (batch = 1): 0.4ms (0.36ms input + 0.04ms last output) GTX 1080 Ti excess for YOLOv4Tiny 416x416 (batch = 1): 0.22ms (0.17ms input + 0.055ms last output)

RTX 1080 Ti excess for YOLOv4 608x608 (batch = 1): 0.4ms (0.36ms input + 0.04ms last output) RTX 1080 Ti excess for YOLOv4 608x608 (batch = 4): 1.65ms (1.52ms input + 0.12ms last output) RTX 1080 Ti excess for YOLOv4Tiny 416x416 (batch = 1): 0.23ms (0.17ms input + 0.05ms last output) RTX 1080 Ti excess for YOLOv4Tiny 416x416 (batch = 4): 0.93ms (0.72ms input + 0.2ms last output)

For example, for YOLOv4 Tiny on RTX 2080 Ti: 2.957ms - 0.93ms = 2.027ms which is 0.507ms per batch item (1972 FPS).

It's not very meaningful to do this procedure. The data transfer cost becomes negligible as you use low-end devices (the computation takes most of the time).

Thank you for the great post. Any comments or benchmarks while running inference on CPU with IE Backend? Thanks

AlexeyAB commented 4 years ago

@YashasSamaga Geat!

It seems that OpenCV is now faster than tkDNN-TensorRT for yolov4.cfg in the most cases even with the cost of data transmission!

Can you test, what AVG_FPS do you get for YOLOv4-tiny 416x416 on RTX 2080Ti with flag -benchmark and wait 10 seconds?

YashasSamaga commented 4 years ago

@mmaaz60

7700HQ, GTX 1050 Mobile CUDA 10.2, cuDNN 7.6.5 (CUDA timings in detailed stats collapsible) MKL 2020.1.217 OpenVINO 2020.3.194

OCV CPU (batch = 1)	OCV CPU (batch = 4)	OCV IE CPU (batch = 1)	OCV IE CPU (batch = 4)
28	26	42	39

detailed stats for batch = 1

``` YOLO v4 [OCV CPU] init >> 991.24ms inference >> min = 644.96ms, max = 729.546ms, mean = 681.779ms, stddev = 14.5892ms OpenCV(ocl4dnn): consider to specify kernel configuration cache directory via OPENCV_OCL4DNN_CONFIG_PATH parameter. [OCV OpenCL] init >> 4756.8ms inference >> min = 717.424ms, max = 729.811ms, mean = 722.888ms, stddev = 2.72144ms [OCV OpenCL FP16] init >> 4399.15ms inference >> min = 642.341ms, max = 650.567ms, mean = 645.951ms, stddev = 1.47902ms [IE CPU] init >> 3188.47ms inference >> min = 764.854ms, max = 801.476ms, mean = 774.116ms, stddev = 6.2149ms permute_161 is CPU permute_150 is CPU permute_139 is CPU [IE OpenCL] init >> 81394.1ms inference >> min = 840.67ms, max = 863.344ms, mean = 850.657ms, stddev = 4.33734ms permute_161 is CPU permute_150 is CPU permute_139 is CPU [IE OpenCL FP16] init >> 90492ms inference >> min = 584.408ms, max = 606.073ms, mean = 599.357ms, stddev = 3.60988ms [CUDA FP32] init >> 318.422ms inference >> min = 93.828ms, max = 94.906ms, mean = 94.3823ms, stddev = 0.235932ms YOLO v4 Tiny [OCV CPU] init >> 49.777ms inference >> min = 33.964ms, max = 41.39ms, mean = 35.2225ms, stddev = 1.12457ms [OCV OpenCL] init >> 333.55ms inference >> min = 31.188ms, max = 33.465ms, mean = 31.8098ms, stddev = 0.269842ms [OCV OpenCL FP16] init >> 359.553ms inference >> min = 33.403ms, max = 34.931ms, mean = 33.9486ms, stddev = 0.373696ms [IE CPU] init >> 194.137ms inference >> min = 22.768ms, max = 33.13ms, mean = 23.7671ms, stddev = 0.99034ms permute_30 is CPU permute_37 is CPU [IE OpenCL] init >> 9606.9ms inference >> min = 56.085ms, max = 58.207ms, mean = 56.9763ms, stddev = 0.489639ms permute_30 is CPU permute_37 is CPU [IE OpenCL FP16] init >> 10028.5ms inference >> min = 47.272ms, max = 49.925ms, mean = 48.2162ms, stddev = 0.517044ms [CUDA FP32] init >> 32.134ms inference >> min = 6.317ms, max = 6.486ms, mean = 6.38243ms, stddev = 0.0328566ms ```

detailed stats for batch = 4

``` YOLO v4 Tiny [OCV CPU] init >> 154.099ms inference >> min = 144.885ms, max = 162.957ms, mean = 150.885ms, stddev = 2.49257ms [IE CPU] init >> 366.507ms inference >> min = 101.042ms, max = 113.067ms, mean = 102.515ms, stddev = 1.63638ms [CUDA FP32] init >> 54.44ms inference >> min = 20.667ms, max = 22.001ms, mean = 20.8692ms, stddev = 0.271758ms ```

@AlexeyAB

./darknet detector demo cfg/coco.data cfg/yolov4-tiny.cfg yolov4-tiny.weights test.mp4 -benchmark

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1

gives 443 AVG FPS on RTX 2080 Ti after ~20s

(including: nms, pre/post-processing, transfering CPU->GPU and GPU->CPU)

NMS/pre/post-processing is not included in the timings I reported. CPU->GPU, inference and GPU->CPU is included.

YashasSamaga commented 4 years ago

@AlexeyAB I think the tkDNN timings in darknet readme is outdated. The new tkDNN timings beats OpenCV in most cases but OpenCV seems to outperform tkDNN with the data transfer correction.

If someone has their input already on GPU and require outputs on GPU, tkDNN will beat OpenCV. If the input is on CPU and outputs are required on CPU, OpenCV will likely beat tkDNN.

AlexeyAB commented 4 years ago

@YashasSamaga

Thanks, I fixed it.

Did somebody try to implement NMS and/or pre-processing (resizing/converting RGB->float) on GPU for OpenCV / tkDNN-TRT?
What will be faster?
1. inference for all 3 x [yolo] layer, NMS on GPU, data-transfer GPU->CPU
2. inference [yolo]_1, GPU->CPU, inference [yolo]_2, GPU->CPU, inference [yolo]_3, GPU->CPU, NMS on CPU
Why do you get lower FPS for batch=4 (39 FPS) than for batch=1 (42 FPS) ? Or is it actually 39 x 4 = 156 FPS?

OCV CPU (batch = 1) OCV CPU (batch = 4) OCV IE CPU (batch = 1) OCV IE CPU (batch = 4)

28 26 42 39

What FPS do you get for on Darknet and your CPU for yolov4.cfg and yolov4-tiny.cfg? ./darknet detector demo cfg/coco.data cfg/yolov4-tiny.cfg yolov4-tiny.weights test.mp4 -benchmark

GPU=0
CUDNN=0
CUDNN_HALF=0
OPENCV=1
OPENMP=1
AVX=1

YashasSamaga commented 4 years ago

7700HQ

608 x 608 input for YOLOv4 416 x 416 input for YOLOv4 Tiny

Numbers in the table indicate FPS.

Model	Darknet	OCV CPU (batch = 1)	OCV IE (batch = 1)	OCV CPU (batch = 4)	OCV IE (batch = 4)
YOLOv4	0.2	1.4	1.0	1.36	1.16
YOLOv4Tiny	3.4	28	43	25.6	40.4

So batch = 4 is again slower. Or batch = 1 is wrongly fast. Or the way I benchmark is not correct for CPU workloads.

Darknet configuration

``` GPU=0 CUDNN=0 CUDNN_HALF=0 OPENCV=1 AVX=1 OPENMP=1 ``` ``` GPU isn't used Used AVX Used FMA & AVX2 OpenCV version: 4.4.0 ```

stats for batch = 1

``` YOLO v4 [OCV CPU] init >> 1157.03ms inference >> min = 692.25ms, max = 721.211ms, mean = 706.168ms, stddev = 7.08872ms [IE CPU] init >> 3536.66ms inference >> min = 890.889ms, max = 986.282ms, mean = 965.575ms, stddev = 18.8663ms [CUDA FP32] init >> 400.976ms inference >> min = 93.714ms, max = 94.648ms, mean = 94.2087ms, stddev = 0.214239ms YOLO v4 Tiny [OCV CPU] init >> 52.138ms inference >> min = 34.405ms, max = 51.909ms, mean = 35.7253ms, stddev = 3.06929ms [IE CPU] init >> 197.124ms inference >> min = 22.969ms, max = 24.368ms, mean = 23.3028ms, stddev = 0.275993ms [CUDA FP32] init >> 30.454ms inference >> min = 6.32ms, max = 6.43ms, mean = 6.36987ms, stddev = 0.0295561ms ```

stats for batch = 4

``` YOLO v4 [OCV CPU] init >> 3150.34ms inference >> min = 2909.84ms, max = 2975.12ms, mean = 2933.99ms, stddev = 17.1756ms [IE CPU] init >> 6583.96ms inference >> min = 3405.3ms, max = 3482.47ms, mean = 3452.57ms, stddev = 15.9687ms [CUDA FP32] init >> 833.885ms inference >> min = 352.486ms, max = 357.206ms, mean = 354.321ms, stddev = 1.34339ms YOLO v4 Tiny [OCV CPU] init >> 154.536ms inference >> min = 140.077ms, max = 203.018ms, mean = 156.076ms, stddev = 18.0619ms [IE CPU] init >> 362.521ms inference >> min = 98.315ms, max = 105.527ms, mean = 99.7235ms, stddev = 1.27514ms [CUDA FP32] init >> 52.426ms inference >> min = 20.757ms, max = 23.366ms, mean = 21.356ms, stddev = 0.679238ms ```

AlexeyAB commented 4 years ago

@YashasSamaga Thanks!

YOLOv4Tiny | 3.4 FPS (Darknet)

So you get 2x slower FPS

on Core i7 7700HQ (2.8/3.8 GHz turbo boost, 8HT cores) Mobile - 3.4 FPS https://ark.intel.com/content/www/ru/ru/ark/products/97185/intel-core-i7-7700hq-processor-6m-cache-up-to-3-80-ghz.html
than I on Core i7 6700K (4.0/4.2 GHz turbo boost, 8HT cores) Desktop - 6.1 FPS https://ark.intel.com/content/www/ru/ru/ark/products/88195/intel-core-i7-6700k-processor-8m-cache-up-to-4-20-ghz.html

Did you close all other applications (web-browser, ...) when test it with -benchmark flag?

YashasSamaga commented 4 years ago

Did you close all other applications (web-browser, ...) when test it with -benchmark flag?

Yes. Darknet is using ~780% of the CPU (read it off the top command).

Benchmarking on CPUs is always confusing. I have seen FPS double by using a different BLAS library in my projects. I have also seen FPS triple moving from 7700HQ to 7700K!

Anafeyka commented 4 years ago

Comparison of Yolo4-tiny, Yolo4-tiny3l, Yolo4-tiny3l-spp. Custom dataset https://www.youtube.com/watch?v=RtEogGr3aW8&feature https://www.youtube.com/watch?v=hpjh_SEXtm0

Anafeyka commented 4 years ago

@Anafeyka Could you share your Yolo4-tiny3l-spp cfg? Your results are very good.

yolov4-tiny_3lSPP.txt

mmaaz60 commented 4 years ago

@Anafeyka Could you share your Yolo4-tiny3l-spp cfg? Your results are very good.

yolov4-tiny_3lSPP.txt

Thanks @Anafeyka,

What is the speed comparison of Yolov4-tiny vs Yolov4-tiny3l & Yolov4-tiny3l-spp?

Anafeyka commented 4 years ago

@Anafeyka Could you share your Yolo4-tiny3l-spp cfg? Your results are very good.

yolov4-tiny_3lSPP.txt

Thanks @Anafeyka,

What is the speed comparison of Yolov4-tiny vs Yolov4-tiny3l & Yolov4-tiny3l-spp?

I can't check the speed on a lot of equipment right now But I can give you all the weights. https://drive.google.com/file/d/1aSFz5X9OkK8ZeDJoeTc-NrXw4ojf53t6/view?usp=sharing

vinorth-v commented 4 years ago

@Anafeyka Could you share your Yolo4-tiny3l-spp cfg? Your results are very good.

yolov4-tiny_3lSPP.txt

Could you please also share the one of yolo4 tiny 3l please?

Anafeyka commented 4 years ago

@Anafeyka Не могли бы вы поделиться своим Yolo4-tiny3l-sppCFG? Ваши результаты очень хорошие.

yolov4-tiny_3lSPP.txt

Не могли бы вы также поделиться с Yolo4 Tiny 3l, пожалуйста?

1 https://github.com/AlexeyAB/darknet/tree/master/cfg 2 All cfg + weights https://drive.google.com/file/d/1aSFz5X9OkK8ZeDJoeTc-NrXw4ojf53t6/view?usp=sharing

AlexeyAB commented 4 years ago

Discussion: https://www.reddit.com/r/MachineLearning/comments/hu7lyt/p_yolov4tiny_speed_1770_fps_tensorrtbatch4/

AlexeyAB commented 4 years ago

OpenCV 4.4.0 is realeasd, it supports YOLOv4 and YOLOv4-tiny: https://github.com/opencv/opencv/wiki/ChangeLog#version440

All OpenCV releases: https://opencv.org/releases/

Discussion: https://github.com/AlexeyAB/darknet/issues/6284

LukeAI commented 4 years ago

was the conclusion that opencv is faster than tkdnn for yolov4 / yolov4-tiny?

YashasSamaga commented 4 years ago

was the conclusion that opencv is faster than tkdnn for yolov4 / yolov4-tiny?

There is no boolean answer to this. It depends on the device. tkDNN Is likely faster on all low-end devices. You might have to test both frameworks on high-end devices. The location (CPU or GPU) of your input and output might also make a difference.

choochtech commented 4 years ago

@Anafeyka did you use the https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.conv.29 weights when you trained Yolo4-tiny3l and Yolo4-tiny3l-spp for faces ? great work !

Anafeyka commented 4 years ago

@Anafeyka did you use the https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.conv.29 weights when you trained Yolo4-tiny3l and Yolo4-tiny3l-spp for faces ? great work !

No darknet.exe detector train custom/hf_obj.data custom/yolov4-tiny_3l.cfg -map

marvision-ai commented 4 years ago

was the conclusion that opencv is faster than tkdnn for yolov4 / yolov4-tiny?

There is no boolean answer to this. It depends on the device. tkDNN Is likely faster on all low-end devices. You might have to test both frameworks on high-end devices. The location (CPU or GPU) of your input and output might also make a difference.

@LukeAI @YashasSamaga , I have been testing OpenCV vs. TKDNN on Jetson Xavier.

It seems like TKDNN is a bit faster than OpenCV...But it's licensing causes minor problems if you want to use it for commercial purposes.

As Yashas mentioned, I do not know if TKDNN includes the preprocessing/NMS/postprocessing in their overall timing.

choochtech commented 4 years ago

@Anafeyka did you use the https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.conv.29 weights when you trained Yolo4-tiny3l and Yolo4-tiny3l-spp for faces ? great work !

No darknet.exe detector train custom/hf_obj.data custom/yolov4-tiny_3l.cfg -map

Thanks @Anafeyka

LukeAI commented 4 years ago

was the conclusion that opencv is faster than tkdnn for yolov4 / yolov4-tiny?

There is no boolean answer to this. It depends on the device. tkDNN Is likely faster on all low-end devices. You might have to test both frameworks on high-end devices. The location (CPU or GPU) of your input and output might also make a difference.

@LukeAI @YashasSamaga , I have been testing OpenCV vs. TKDNN on Jetson Xavier.

It seems like TKDNN is a bit faster than OpenCV...But it's licensing causes minor problems if you want to use it for commercial purposes.

you mean the GPL?

marvision-ai commented 4 years ago

was the conclusion that opencv is faster than tkdnn for yolov4 / yolov4-tiny?

There is no boolean answer to this. It depends on the device. tkDNN Is likely faster on all low-end devices. You might have to test both frameworks on high-end devices. The location (CPU or GPU) of your input and output might also make a difference.

@LukeAI @YashasSamaga , I have been testing OpenCV vs. TKDNN on Jetson Xavier. It seems like TKDNN is a bit faster than OpenCV...But it's licensing causes minor problems if you want to use it for commercial purposes.

you mean the GPL?

Yes, I have read a few comments in different threads where people had to go the opencv route to avoid that license. Again, that's not my case but I guess some industries/companies do not accept GPL in their products?

alexanderfrey commented 4 years ago

@Anafeyka Did you try to convert your model to tensorrt ?

Anafeyka commented 4 years ago

@Anafeyka Did you try to convert your model to tensorrt ?

Nope!

xaerincl commented 4 years ago

@Anafeyka Do you know if its possible to start training my own tiny-spp by starting with your trained weights instead of the tiny-yolov4 ones?

Anafeyka commented 4 years ago

@xaerincl Bad idea. I used a specially corrupted set of training data with incorrect labels.

wwzh2015 commented 4 years ago

How is Top -1 accuracy the yolo4-tiny backbone in ImageNet and the AP more than pelee-ssd in coco?

sctrueew commented 4 years ago

@AlexeyAB Hi,

Can we use random=1 in y-v4-tiny? and should we use num_of_clusters = 6 for calc_anchors?

Thanks

Anafeyka commented 4 years ago

@zpmmehrdad 1 Yes. 2 Yes.

sharoseali commented 4 years ago

@AlexeyAB thanks for v4_tiny. I want to ask about the model architecture of tiny v4. As yolov4 has backbone of cspdraknet53, Yolov3 has backbone of darknet53 and yolov3_tiny has backbone of darknet53_tiny. Similarly what is the backbone of tiny_v4. I am going to convert tiny yolov4 weights to keras h5 format, so I want to clear this thing before.. Thank U so much.

AlexeyAB commented 4 years ago

yolov3 - darknet53.cfg
yolov3-tiny - darknet.cfg
yolov4 - csdarknet53-omega.cfg
yolov4 - darkv4.cfg

sharoseali commented 4 years ago

yolov3 - darknet53.cfg

yolov3-tiny - darknet.cfg

yolov4 - csdarknet53-omega.cfg

yolov4 - darkv4.cfg

Thanks @AlexeyAB but you mentioned darkv4.cfg .. its not in cfg folder of this repo. I know we have yolov4_tiny.cfg. I was just looking for backbone code for tiny_yolov4, like in this repository. I have to add backbone code. For example for darknet53_tiny the author write backbone in python as follows:

def darknet53_tiny(input_data):
    input_data = common.convolutional(input_data, (3, 3, 3, 16))
    input_data = tf.keras.layers.MaxPool2D(2, 2, 'same')(input_data)
    input_data = common.convolutional(input_data, (3, 3, 16, 32))
    input_data = tf.keras.layers.MaxPool2D(2, 2, 'same')(input_data)
    input_data = common.convolutional(input_data, (3, 3, 32, 64))
    input_data = tf.keras.layers.MaxPool2D(2, 2, 'same')(input_data)
    input_data = common.convolutional(input_data, (3, 3, 64, 128))
    input_data = tf.keras.layers.MaxPool2D(2, 2, 'same')(input_data)
    input_data = common.convolutional(input_data, (3, 3, 128, 256))
    route_1 = input_data
    input_data = tf.keras.layers.MaxPool2D(2, 2, 'same')(input_data)
    input_data = common.convolutional(input_data, (3, 3, 256, 512))
    input_data = tf.keras.layers.MaxPool2D(2, 1, 'same')(input_data)
    input_data = common.convolutional(input_data, (3, 3, 512, 1024))
    return route_1, input_data

and model defination like:

def YOLOv3_tiny(input_layer, NUM_CLASS):
    route_1, conv = backbone.darknet53_tiny(input_layer)

    conv = common.convolutional(conv, (1, 1, 1024, 256))

    conv_lobj_branch = common.convolutional(conv, (3, 3, 256, 512))
    conv_lbbox = common.convolutional(conv_lobj_branch, (1, 1, 512, 3 * (NUM_CLASS + 5)), activate=False, bn=False)

    conv = common.convolutional(conv, (1, 1, 256, 128))
    conv = common.upsample(conv)
    conv = tf.concat([conv, route_1], axis=-1)

    conv_mobj_branch = common.convolutional(conv, (3, 3, 128, 256))
    conv_mbbox = common.convolutional(conv_mobj_branch, (1, 1, 256, 3 * (NUM_CLASS + 5)), activate=False, bn=False)

    return [conv_mbbox, conv_lbbox]

I want to write same code for tiny YOLO_v4. Any help or suggestion......

AlexeyAB commented 4 years ago

Look at this: https://github.com/hunglc007/tensorflow-yolov4-tflite/blob/master/core/backbone.py#L107-L147

wuzhenxin1989 commented 4 years ago

@AlexeyAB 我用官方的yolov4-tiny.weights 和yolov4-tiny.cfg、模型参数416*416，在RTX2080TI测试dog.jpg图片的时间约20ms，测试时间很慢，怎么和您测试时间差别这么大。求解答

AlexeyAB / darknet

YOLOv4-tiny released: 40.2% AP50, 371 FPS (GTX 1080 Ti), 1770 FPS tkDNN/TensorRT #6067

YOLOv4

Performance

Accuracy

YOLOv4 Tiny

GPU-CPU data transfer and comparing with tkDNN

YOLOv4

Performance

Accuracy

YOLOv4 Tiny

GPU-CPU data transfer and comparing with tkDNN