AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.56k stars 7.95k forks source link

opencv dnn & yolov4, yolov4-tiny Performance #6245

Open LanguageFlowCho opened 3 years ago

LanguageFlowCho commented 3 years ago

Hi guys

I now want to see the detection performance of yolov4 and yolov4-tiny using opencv dnn, but I was shocked that the results were so different from the results of the issues I had seen.

So I'd like to ask AlexeyAB how the existing benchmark performance came out?? Proceeding to detect yolov4, yolov4-tiny using opencv dnn Setting standards as below

Ubuntu 18.04 cuda 10.2 cudnn 7.6.5 opencv 4.3.0 darknet -> yolov4.cfg & weight, coco.data : 80fps -> yolov4-tiny.cfg & weight, coco.data : 500fps cudnn -> yolov4.cfg & weight, coco.data : 80fps -> yolov4-tiny.cfg & weight, coco.data : 400fps

Benchmark performance in existing issues and too low performance Can you tell me what I am missing For reference, gpu is rtx 2080 ti

YashasSamaga commented 3 years ago

Latest OpenCV benchmark: https://github.com/AlexeyAB/darknet/issues/6067#issuecomment-656604015

Please share the code you used to measure the FPS.

Can you try using this to measure FPS?

EDIT: Sorry, I think I misunderstood. Are you complaining about OpenCV's performance or having difficulties reproducing the reported performance or complaining about Darknet's performance?

LanguageFlowCho commented 3 years ago

Thank you for answer. YashasSamaga I am not misunderstanding the performance of opencv and darknet's bench. I ran the opencv dnn bench with python code.

I wrote the source code by referring to Adrian Rosebrock's blog. https://www.pyimagesearch.com/2020/02/10/opencv-dnn-with-nvidia-gpus-1549-faster-yolo-ssd-and-mask-r-cnn/

When I ran the above code in python, it showed much worse performance than I expected. I saw the benchmark you uploaded and proceeded the same setting. I wanted to check if there are any problems caused by using opencv cudnn in python.

If you know, I would like to know the cause or rate of decrease in performance when using opencv cudnn in Python.

YashasSamaga commented 3 years ago

@LanguageFlowCho The code in that article isn't very good for computing the FPS. Here is my comment in the same post:

I just ran your code on my PC. It turns out that the OpenCV DNN on GPU is so fast that the non-DNN part of the code takes up 85% of the time. Hence, the benchmark code isn’t really measuring the DNN performance; rather, it is measuring the pre/postprocessing and the IO part.

Here is some estimate of how much impact the non-DNN part is having on your device:

This post reports an FPS of 12 => ~83ms per frame.

On RTX 2080 Ti, it takes 10ms for the inference. It should be even faster on V100. Let’s be conservative and take it to be 10ms.

Approximately 73ms of the time is spent in IO and non-DNN stuff.

For a fair comparison of DNN performance (the non-DNN time is significant for CPU inference too), I’d recommend measuring the time taken to execute net.forward(outLayerNames).

Almost all of the 85% can be reduced to a few milliseconds or less if the pre/postprocessing is done in C++. It's just that python is very slow.

OpenCV added high level API for the DNN module which does the preprocessing and postprocessing for you. This is much faster as it's done in C++. You can call it from python too. I have a python version here which uses it. The script reports the FPS for inference + preprocessing + postprocessing.

Note that the FPS reported by the script still won't exactly match with the benchmarks I reported. The benchmarks I reported include inference and GPU-CPU data transfer time.

If you are writing a performance-critical application and need the best performance, you have to use C++ instead of python.

EDIT: fixed the link to the python script

ryumansang commented 3 years ago

@YashasSamaga So if compile the cpp file as a .so file and use it in the form of cython-python, can we expect an approximate detection speed of c++?

YashasSamaga commented 3 years ago

@ryumansang Yes. But you could try with dnn_DetectionModel first. It does preprocessing and postprocessing (including NMS) and returns a small list of boxes. In most cases, this should give you decent performance without python slowing things down considerably.

PallHaraldsson commented 3 years ago

If you are writing a performance-critical application and need the best performance, you have to use C++ instead of python.

No, you just need to avoid Python. C++ is most often used for that, but you can use Julia and it's as fast and compiles to GPUs too. And works well from Python too if you need that (pyjulia), and Julia can call Python code (e.g. tensorflow, keras), or use state-of-the-art Julia ML libraries.

LanguageFlowCho commented 3 years ago

Thank you for the answer YashasSamaga

If you have to use python, I wonder if performance decreases even if you use it in the same way as cython. If you have experimented, please answer.

Thanks for the answer Pallhararldsson

But there is a question. If you are using python + cython instead of just python, What are the results? Expected?

YashasSamaga commented 3 years ago

If you have to use python, I wonder if performance decreases even if you use it in the same way as cython. If you have experimented, please answer.

The code is written in C/++ where you do pretty much everything. You call a function in python which returns the frame or whatever final result you need. The bulk of the calculation is done in C++ and python just makes a function call. I haven't tried though. My C++ version gives 19 FPS on GTX 1050 while the python version with dnn_DetectionModel gives 13 FPS (lot of time taken to label and draw boxes on frame).

What are the results? Expected?

If you move all the heavy code out of python to C++, it should be as fast as the C++ version.

LanguageFlowCho commented 3 years ago

파이썬을 사용해야하는 경우 Cython과 같은 방식으로 사용하더라도 성능이 저하되는지 궁금합니다. 실험을했다면 대답하십시오.

코드는 C / ++로 작성되어 거의 모든 작업을 수행합니다. 파이썬에서 함수를 호출하여 프레임이나 필요한 최종 결과를 반환합니다. 대부분의 계산은 C ++에서 수행되며 파이썬은 함수 호출을 수행합니다. 그래도 시도하지 않았습니다. 내 C ++ 버전은 GTX 1050에서 19 FPS를 제공하는 반면 파이썬 버전 dnn_DetectionModel은 13 FPS 를 제공합니다 (프레임에 상자를 표시하고 상자를 그리는 데 많은 시간이 걸렸습니다).

결과는 무엇입니까? 기대 했습니까?

모든 무거운 코드를 파이썬에서 C ++로 옮기면 C ++ 버전만큼 빠릅니다.

Thanks for the answer YashasSamaga Your answer was very helpful.

marcusbrito commented 3 years ago

I've built an flask API that takes an image via post request, and run yolov4 with dnn_DetectionModel to return the found objects.

I'm currently passing one image at each request. Is it faster to pass multiple images at the same time? If so, how much faster?

I'm using it on CPU, but maybe i will use it on a GPU.

YashasSamaga commented 3 years ago

I'm currently passing one image at each request. Is it faster to pass multiple images at the same time? If so, how much faster?

dnn_DetectionModel does not support a batch of images. You can write a version in C++ which can process a batch of images.

It's difficult to say whether using a batch of images will be faster on CPU. Batch inference was slower on my CPU but it probably depends on the CPU specs. It's best to test on your device.

You can test performance using this. You can test performance for different batch sizes by setting default_batch_size.

hlacikd commented 3 years ago

question, i understand DNN_TARGET_CUDA_FP16 is slower on GTX 1080 (since it has no half precision cores), difference is litterally 20fps vs 250fps but why the same applies for JETSON NANO? aka DNN_TARGET_CUDA gives 17fps , DNN_TARGET_CUDA_FPS16 gives 1fps.

I am using yolov4_tiny custom model on both

YashasSamaga commented 3 years ago

why the same applies for JETSON NANO?

That's unusual. How did you measure the FPS? Note that the first few forward passes will be slow due to lazy initialization.

Based on this comment https://github.com/ceccocats/tkDNN/issues/59#issuecomment-652157334, tkDNN is faster than OpenCV in FP32. tkDNN will beat OpenCV in weaker devices. OpenCV has better-tuned kernels but it relies on cuDNN for convolution. TensorRT convolutions are faster than cuDNN. Hence, if convolutions are the bottleneck (often the case in low-end devices), a TensorRT solution will outcompete a cuDNN based solution no matter how tuned non-convolution operations are.

cuDNN 8 has exposed a lot of tuning options so it might be possible to achieve (I hope so) TensorRT performance with cuDNN. We have to wait for the official release.

@AlexeyAB I think there should be a disclaimer next to the benchmark in the README. OpenCV will lose to tkDNN in low-end devices as convolutions will vastly dominate the inference time. OpenCV doesn't support TensorRT yet so convolutions are slow. OpenCV will probably match or outperform in high-end GPUs only (where gains from non-conv ops might match or outweigh loss in conv performance by not using TensorRT).

hlacikd commented 3 years ago

why the same applies for JETSON NANO?

That's unusual. How did you measure the FPS? Note that the first few forward passes will be slow due to lazy initialization.

unmodified yolov4.py python script from you , opencv compiled yesterday from git (master branch) with CUDA, CUDDN from jetson jetpack 4.4 (cuda 10.2, cuddn 8).

Could it be an issue with cuddn 8 ? Should i try older 7?

net = cv2.dnn.readNet( "yolov4-tiny_lp_final.weights", "yolov4-tiny_lp.cfg" ) net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

result :

FPS: 0.22 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.90 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.90 (excluding drawing time of 0.01ms) FPS: 0.90 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.90 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms) FPS: 0.91 (excluding drawing time of 0.01ms)

and

net = cv2.dnn.readNet( "yolov4-tiny_lp_final.weights", "yolov4-tiny_lp.cfg" ) net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

result

FPS: 0.31 (excluding drawing time of 0.02ms) FPS: 16.19 (excluding drawing time of 0.01ms) FPS: 16.98 (excluding drawing time of 0.01ms) FPS: 17.39 (excluding drawing time of 0.01ms) FPS: 17.40 (excluding drawing time of 0.01ms) FPS: 17.76 (excluding drawing time of 0.01ms) FPS: 17.96 (excluding drawing time of 0.01ms) FPS: 17.90 (excluding drawing time of 0.01ms) FPS: 17.82 (excluding drawing time of 0.01ms) FPS: 17.77 (excluding drawing time of 0.01ms) FPS: 17.52 (excluding drawing time of 0.01ms) FPS: 17.61 (excluding drawing time of 0.01ms) FPS: 17.87 (excluding drawing time of 0.01ms) FPS: 17.75 (excluding drawing time of 0.01ms) FPS: 17.70 (excluding drawing time of 0.01ms) FPS: 17.50 (excluding drawing time of 0.01ms) FPS: 17.61 (excluding drawing time of 0.01ms) FPS: 17.56 (excluding drawing time of 0.01ms) FPS: 17.80 (excluding drawing time of 0.01ms) FPS: 17.95 (excluding drawing time of 0.01ms) FPS: 17.78 (excluding drawing time of 0.01ms) FPS: 17.89 (excluding drawing time of 0.01ms) FPS: 17.87 (excluding drawing time of 0.01ms) FPS: 17.83 (excluding drawing time of 0.01ms) FPS: 17.91 (excluding drawing time of 0.01ms) FPS: 17.66 (excluding drawing time of 0.01ms)

YashasSamaga commented 3 years ago

Could it be an issue with cuddn 8 ? Should i try older 7?

It could be. cuDNN 8 does have performance regressions. Can you call net.enableFusion(false) and measure the FPS again?

The release notes says

The performance of cudnnConvolutionBiasActivationForward() is slower than v7.6 in most cases. This is being actively worked on and performance optimizations will be available in the upcoming releases.

OpenCV once had a massive performance regression on high-end GPUs because of using cudnnConvolutionBiasActivationForward() for depthwise convolutions. MobileNet inference time went from 2ms to 35ms.

hlacikd commented 3 years ago

@hlacik replied here https://gist.github.com/YashasSamaga/e2b19a6807a13046e399f4bc3cca3a49#gistcomment-3381161

Could it be an issue with cuddn 8 ? Should i try older 7?

It could be. cuDNN 8 does have performance regressions. Can you call net.enableFusion(false) and measure the FPS again?

I am currently rebuilding opencv under jetpack 4.3 (cuda 10.0 cuddn 7), and will get back to you, thank you for prompt responses!

net.enableFusion(False) is not helping

hlacikd commented 3 years ago

I am currently rebuilding opencv under jetpack 4.3 (cuda 10.0 cuddn 7), and will get back to you, thank you for prompt responses!

net.enableFusion(False) is not helping

So I am running jetpack 4.3 with cuda ver 10.0 and cuDNN 7.6.3, and it performs better now

net = cv2.dnn.readNet( "/home/david/Downloads/yolov4-tiny_lp_final.weights", "/home/david/Downloads/yolov4-tiny_lp.cfg" ) net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

FPS: 0.29 (excluding drawing time of 0.01ms) FPS: 18.36 (excluding drawing time of 0.01ms) FPS: 18.69 (excluding drawing time of 0.01ms) FPS: 19.27 (excluding drawing time of 0.01ms) FPS: 19.25 (excluding drawing time of 0.01ms) FPS: 19.31 (excluding drawing time of 0.01ms) FPS: 19.26 (excluding drawing time of 0.01ms) FPS: 19.14 (excluding drawing time of 0.01ms) FPS: 19.32 (excluding drawing time of 0.01ms) FPS: 19.23 (excluding drawing time of 0.01ms)

and

net = cv2.dnn.readNet( "/home/david/Downloads/yolov4-tiny_lp_final.weights", "/home/david/Downloads/yolov4-tiny_lp.cfg" ) net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

FPS: 20.68 (excluding drawing time of 0.01ms) FPS: 20.20 (excluding drawing time of 0.01ms) FPS: 20.78 (excluding drawing time of 0.01ms) FPS: 20.84 (excluding drawing time of 0.01ms) FPS: 20.86 (excluding drawing time of 0.01ms) FPS: 20.41 (excluding drawing time of 0.01ms)

so problem is caused by cudnn 8 and jetpack 4.4 also notice, that performance with FP16 is only slightly (~2fps) better than FP32, why is it so?

YashasSamaga commented 3 years ago

also notice, that performance with FP16 is only slightly (~2fps) better than FP32, why is it so?

I don't have any Jetson device to test on. OpenCV makes some redundant copies on Jetson and it is at the mercy of cuDNN for good convolution performance.

tkDNN is faster on Jetson Nano. It achieves around 40FPS in FP16 (https://github.com/ceccocats/tkDNN/issues/59#issuecomment-652157334) but this does not include NMS, preprocessing and postprocessing.

The CUDA backend in OpenCV DNN is just around 6-7 months old. TensorRT isn't supported yet. Most of OpenCV is optimized on GTX 1050 and occasionally on 1080 Ti and 2080 Ti.

hlacikd commented 3 years ago

thanks i understand, i tried tensorflow to tensorrt model conversion and it makes no much difference on jetson nano either. i suppose it is because of tegra architecture. opencv implementation is slightly faster than using this repo on jetson nano, so i will stick with it for now. i am aware of tkDNN, but i am unable to use it due their licensing for commercial purposes. thank you for your help.

marvision-ai commented 3 years ago

@hlacik Do you mind sharing how you are building the latest 4.4 on the jetson? Just want to make sure I am doing everything correctly :)

hlacikd commented 3 years ago

@hlacik Do you mind sharing how you are building the latest 4.4 on the jetson? Just want to make sure I am doing everything correctly :)

nothing extra special, this is enough

export ARCH_BIN=5.3

cmake -D WITH_CUDA=ON -D CUDA_ARCH_BIN=${ARCH_BIN}

-D CUDA_FAST_MATH=ON -D OPENCV_DNN_CUDA=ON -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules -D CPACK_BINARY_DEB=ON ../

make -j$(nproc) make install -j$(nproc) make package -j$(nproc)

CPACK_BINARY_DEB will enable option to create .deb packages when you call make package -j$(nproc) (but remember you have to still do make install before )

marvision-ai commented 3 years ago

@hlacik Perfect, I have the same. Good to know. Thank you!

marvision-ai commented 3 years ago

Hello @YashasSamaga @hlacik

I have compiled opencv4.4.0 on Jetson Xavier jetpack 4.3. If I then grab the built cython cv2.so and use that with the python script provided here are my results:

yolov4-tiny models trained on COCO (from darknet repo).

288x288 =

FPS: 194.09 (excluding drawing time of 1.62ms)
FPS: 181.78 (excluding drawing time of 1.69ms)
FPS: 157.21 (excluding drawing time of 1.51ms)
FPS: 194.96 (excluding drawing time of 1.71ms)
FPS: 199.45 (excluding drawing time of 1.62ms)
FPS: 178.28 (excluding drawing time of 1.40ms)
FPS: 197.23 (excluding drawing time of 1.95ms)
FPS: 196.20 (excluding drawing time of 1.85ms)
FPS: 182.11 (excluding drawing time of 1.82ms)
FPS: 194.94 (excluding drawing time of 2.04ms)
FPS: 188.17 (excluding drawing time of 1.98ms)
FPS: 191.82 (excluding drawing time of 1.84ms)
FPS: 157.15 (excluding drawing time of 2.41ms)
FPS: 187.06 (excluding drawing time of 1.69ms)
FPS: 186.38 (excluding drawing time of 1.69ms)
FPS: 160.41 (excluding drawing time of 1.94ms)
FPS: 168.09 (excluding drawing time of 1.86ms)
FPS: 191.66 (excluding drawing time of 2.43ms)
FPS: 189.03 (excluding drawing time of 1.67ms)
FPS: 176.98 (excluding drawing time of 1.80ms)
FPS: 154.79 (excluding drawing time of 1.57ms)
FPS: 186.41 (excluding drawing time of 1.89ms)
FPS: 191.90 (excluding drawing time of 1.54ms)
FPS: 172.85 (excluding drawing time of 1.77ms)
FPS: 153.40 (excluding drawing time of 1.72ms)
FPS: 162.99 (excluding drawing time of 1.64ms)
FPS: 199.21 (excluding drawing time of 1.60ms)
FPS: 199.83 (excluding drawing time of 1.57ms)
FPS: 171.77 (excluding drawing time of 1.54ms)
FPS: 202.16 (excluding drawing time of 1.60ms)
FPS: 184.72 (excluding drawing time of 1.69ms)
FPS: 181.63 (excluding drawing time of 1.66ms)
FPS: 202.10 (excluding drawing time of 1.58ms)
FPS: 182.42 (excluding drawing time of 1.79ms)
FPS: 175.27 (excluding drawing time of 1.97ms)
FPS: 200.15 (excluding drawing time of 2.01ms)

416x416 =

FPS: 126.70 (excluding drawing time of 2.82ms)
FPS: 142.04 (excluding drawing time of 2.03ms)
FPS: 131.38 (excluding drawing time of 2.09ms)
FPS: 144.74 (excluding drawing time of 2.03ms)
FPS: 131.04 (excluding drawing time of 2.37ms)
FPS: 135.24 (excluding drawing time of 2.10ms)
FPS: 132.27 (excluding drawing time of 2.02ms)
FPS: 129.18 (excluding drawing time of 2.16ms)
FPS: 143.98 (excluding drawing time of 2.02ms)
FPS: 131.75 (excluding drawing time of 2.07ms)
FPS: 131.77 (excluding drawing time of 1.98ms)
FPS: 119.94 (excluding drawing time of 1.94ms)
FPS: 145.68 (excluding drawing time of 1.96ms)
FPS: 124.45 (excluding drawing time of 2.13ms)
FPS: 134.19 (excluding drawing time of 2.04ms)
FPS: 147.19 (excluding drawing time of 1.98ms)
FPS: 146.16 (excluding drawing time of 2.01ms)
FPS: 133.64 (excluding drawing time of 1.98ms)
FPS: 146.54 (excluding drawing time of 1.78ms)
FPS: 132.53 (excluding drawing time of 1.92ms)
FPS: 129.77 (excluding drawing time of 2.02ms)
FPS: 147.27 (excluding drawing time of 2.10ms)
FPS: 122.59 (excluding drawing time of 2.09ms)
FPS: 137.13 (excluding drawing time of 2.42ms)
FPS: 122.48 (excluding drawing time of 2.34ms)
FPS: 144.37 (excluding drawing time of 2.36ms)
FPS: 134.38 (excluding drawing time of 2.20ms)
FPS: 133.62 (excluding drawing time of 1.92ms)
FPS: 146.47 (excluding drawing time of 2.00ms)
FPS: 147.12 (excluding drawing time of 2.17ms)
FPS: 145.66 (excluding drawing time of 2.02ms)
FPS: 142.45 (excluding drawing time of 2.03ms)
FPS: 146.36 (excluding drawing time of 1.95ms)
FPS: 117.36 (excluding drawing time of 1.77ms)
FPS: 133.15 (excluding drawing time of 1.90ms)
FPS: 122.33 (excluding drawing time of 1.79ms)
FPS: 146.69 (excluding drawing time of 1.81ms)
FPS: 130.04 (excluding drawing time of 2.07ms)
FPS: 142.76 (excluding drawing time of 1.84ms)
FPS: 147.70 (excluding drawing time of 1.75ms)
FPS: 133.32 (excluding drawing time of 2.19ms)
FPS: 147.00 (excluding drawing time of 2.15ms)
FPS: 143.01 (excluding drawing time of 2.22ms)
FPS: 132.86 (excluding drawing time of 2.14ms)
FPS: 116.58 (excluding drawing time of 2.19ms)
FPS: 141.35 (excluding drawing time of 2.18ms)
FPS: 123.85 (excluding drawing time of 2.24ms)
FPS: 133.17 (excluding drawing time of 2.12ms)
FPS: 115.26 (excluding drawing time of 2.18ms)
FPS: 132.74 (excluding drawing time of 2.16ms)
FPS: 136.18 (excluding drawing time of 2.38ms)
FPS: 144.22 (excluding drawing time of 2.40ms)
FPS: 131.69 (excluding drawing time of 2.45ms)
FPS: 142.46 (excluding drawing time of 2.36ms)
FPS: 144.13 (excluding drawing time of 2.23ms)
FPS: 130.47 (excluding drawing time of 2.14ms)
FPS: 132.78 (excluding drawing time of 2.21ms)
FPS: 135.87 (excluding drawing time of 2.40ms)
FPS: 146.19 (excluding drawing time of 2.38ms)
FPS: 118.22 (excluding drawing time of 2.35ms)
FPS: 142.91 (excluding drawing time of 2.35ms)
FPS: 127.80 (excluding drawing time of 2.19ms)
FPS: 142.63 (excluding drawing time of 2.27ms)
FPS: 125.38 (excluding drawing time of 2.23ms)
FPS: 136.97 (excluding drawing time of 2.34ms)
FPS: 146.16 (excluding drawing time of 2.34ms)
FPS: 144.86 (excluding drawing time of 2.52ms)
FPS: 130.15 (excluding drawing time of 2.54ms)
FPS: 140.57 (excluding drawing time of 2.25ms)
FPS: 145.93 (excluding drawing time of 2.62ms)
FPS: 134.17 (excluding drawing time of 2.27ms)
FPS: 130.54 (excluding drawing time of 2.54ms)
FPS: 126.88 (excluding drawing time of 2.55ms)
FPS: 145.32 (excluding drawing time of 2.67ms)

608x608 =

FPS: 74.64 (excluding drawing time of 2.17ms)
FPS: 83.63 (excluding drawing time of 2.43ms)
FPS: 80.10 (excluding drawing time of 2.49ms)
FPS: 73.91 (excluding drawing time of 2.55ms)
FPS: 84.07 (excluding drawing time of 2.33ms)
FPS: 78.86 (excluding drawing time of 2.27ms)
FPS: 78.27 (excluding drawing time of 2.13ms)
FPS: 84.89 (excluding drawing time of 2.17ms)
FPS: 81.33 (excluding drawing time of 2.18ms)
FPS: 82.33 (excluding drawing time of 2.19ms)
FPS: 79.05 (excluding drawing time of 2.21ms)
FPS: 74.64 (excluding drawing time of 2.18ms)
FPS: 85.26 (excluding drawing time of 2.41ms)
FPS: 86.33 (excluding drawing time of 2.11ms)
FPS: 83.71 (excluding drawing time of 2.28ms)
FPS: 81.44 (excluding drawing time of 2.14ms)
FPS: 75.36 (excluding drawing time of 2.24ms)
FPS: 82.90 (excluding drawing time of 2.02ms)
FPS: 81.88 (excluding drawing time of 2.19ms)
FPS: 78.14 (excluding drawing time of 2.19ms)
FPS: 82.59 (excluding drawing time of 2.12ms)
FPS: 81.04 (excluding drawing time of 2.22ms)
FPS: 83.82 (excluding drawing time of 2.25ms)
FPS: 81.31 (excluding drawing time of 2.24ms)
FPS: 80.48 (excluding drawing time of 2.20ms)
FPS: 85.17 (excluding drawing time of 2.14ms)
FPS: 85.52 (excluding drawing time of 2.06ms)
FPS: 79.44 (excluding drawing time of 2.10ms)
FPS: 81.84 (excluding drawing time of 2.23ms)
FPS: 79.77 (excluding drawing time of 2.08ms)
FPS: 81.78 (excluding drawing time of 2.22ms)
FPS: 85.55 (excluding drawing time of 2.28ms)
FPS: 77.70 (excluding drawing time of 2.06ms)
FPS: 80.88 (excluding drawing time of 2.13ms)
FPS: 81.80 (excluding drawing time of 2.41ms)
FPS: 73.83 (excluding drawing time of 2.17ms)
FPS: 81.54 (excluding drawing time of 2.31ms)
FPS: 83.62 (excluding drawing time of 2.44ms)
FPS: 82.83 (excluding drawing time of 2.35ms)

@YashasSamaga Do you know why it varies so much (makes it hard to calculate reliable FPS)? Is that due to some NMS calculation based on the amount of objects on the particular frame?

I have confirmed that net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16) Speeds up inference by a considerable amount.

The numbers I get with tkDNN are considerably faster, but as you said above, it does not include preprocessing/NMS/postprocessing. Also, the licensing of the project is a conflict as well.

thank you for the great implementation! I have an Xavier at my disposal, so please let me know if there are any other benchmarks you want.

YashasSamaga commented 3 years ago

The FPS you reported are for DNN_TARGET_CUDA or DNN_TARGET_CUDA_FP16?

@YashasSamaga Do you know why it varies so much (makes it hard to calculate reliable FPS)? Is that due to some NMS calculation based on the amount of objects on the particular frame?

GPU performance is often very stable compared to CPU. Assuming that you have no other program using the GPU, I can think of couple reasons for the high variance:

The synchronization and CPU workload makes factors like the operating system's scheduling decisions to play a role in the FPS measurement. These effects highly variable (depends on other processes and whatnot). It's difficult to say for sure without profiling but I strongly believe that this is the cause based on previous expierence.

but as you said above, it does not include preprocessing/NMS/postprocessing

I have a C++ version here which can measure FPS excluding preprocessing, NMS and postprocessing. This can provide a more stable FPS measurement and more comparable to tkDNN's FPS. It would still count CPU-GPU transfer time though.

marvision-ai commented 3 years ago

The FPS you reported are for DNN_TARGET_CUDA or DNN_TARGET_CUDA_FP16?

@YashasSamaga Do you know why it varies so much (makes it hard to calculate reliable FPS)? Is that due to some NMS calculation based on the amount of objects on the particular frame?

GPU performance is often very stable compared to CPU. Assuming that you have no other program using the GPU, I can think of couple reasons for the high variance:

  • CPU-GPU transfer involves additional synchronization
  • NMS performance depends on the input image as you said
  • includes preprocessing and postprocessing

The synchronization and CPU workload makes factors like the operating system's scheduling decisions to play a role in the FPS measurement. These effects highly variable (depends on other processes and whatnot). It's difficult to say for sure without profiling but I strongly believe that this is the cause based on previous expierence.

but as you said above, it does not include preprocessing/NMS/postprocessing

I have a C++ version here which can measure FPS excluding preprocessing, NMS and postprocessing. This can provide a more stable FPS measurement and more comparable to tkDNN's FPS. It would still count CPU-GPU transfer time though.

@YashasSamaga The FPS is for when net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16) is used.

Ah that makes sense. I forgot about the whole CPU synchro aspect.

Another question for you: Does model.detect automatically resize the frame based on the model.setInputParams(size=(416, 416), scale=1/256) ?

YashasSamaga commented 3 years ago

Another question for you: Does model.detect automatically resize the frame based on the model.setInputParams(size=(416, 416), scale=1/256) ?

Yes. You don't have to resize beforehand to the network's size. It will be done automatically.

YashasSamaga commented 3 years ago

@marvision-ai Can you try running this on your device 5-6 times if possible:

import cv2
import numpy as np
import time

net = cv2.dnn.readNet("yolov4-tiny.cfg", "yolov4-tiny.weights")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

frame = np.random.randint(255, size=(416, 416, 3), dtype=np.uint8)
blob = cv2.dnn.blobFromImage(frame, 0.00392, (416, 416), [0, 0, 0], True, False)

# warmup
for i in range(10):
    net.setInput(blob)
    detections = net.forward(net.getUnconnectedOutLayersNames())

# benchmark
start = time.time()
for i in range(100):
    net.setInput(blob)
    detections = net.forward(net.getUnconnectedOutLayersNames())
end = time.time()

ms_per_image = (end - start) * 1000 / 100

print("Time per inference: %f ms" % (ms_per_image))
print("FPS: ", 1000.0 / ms_per_image)

I get 150 FPS with this script and ~100 FPS with preprocessing, postprocessing and NMS included on GTX 1050 in FP32 target. That's a big difference.

The FPS reported by this script should be more comparable to tkDNN's FPS. It still counts the CPU-GPU transfer cost though. There is no way to avoid this in the current OpenCV version (there is a feature request issue at OpenCV for direct GpuMat input and outputs).

marvision-ai commented 3 years ago

@YashasSamaga Here is the output:

nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 4.948280 ms
FPS:  202.090429153262
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 4.924431 ms
FPS:  203.0691527682023
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.097697 ms
FPS:  196.16702266327243
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.081832 ms
FPS:  196.77941331706916
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 4.829888 ms
FPS:  207.04412375938023
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.090086 ms
FPS:  196.460317096008
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.069916 ms
FPS:  197.24191702990439
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 4.932575 ms
FPS:  202.73385690366538
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.111401 ms
FPS:  195.6410745579154

As a reference, TKDNN at 416x416 gives around 250FPS using FP16 on the Xavier.

hlacikd commented 3 years ago

@marvision-ai

thank you for the great implementation! I have an Xavier at my disposal, so please let me know if there are any other benchmarks you want.

could you please also do a benchmark with alexeyab darknet ? i am really curious if it will be slower/faster than opencv. i did some heavy testing on jetson nano , and i found out that they both detect at same speed. First i thought darknet version to be faster, but then i found out, that in case of darknet i forgot to include cpu time for cv2.resize to 416/416 px, (which is in case of model.detect for opencv dnn detection hidden inside) , after that both frameworks behave at same speed withing a margin of error. (of course in fp32, since darknet has no fp16)

marvision-ai commented 3 years ago

@hlacik @YashasSamaga Sure thing. These will be on the standard 416x416 tiny-yolo

Here are 2 ways I tested it:

  1. Using pure C darknet code ./darknet detector demo cfg/coco.data backup/yolov4-tiny.cfg backup/yolov4-tiny.weights -ext_output /home/nvidia/darknet/match5-c1.avi

Results: Avg FPS = 39.5

person: 98%     (left_x:   10   top_y:   49   width:   18   height:   50)
person: 95%     (left_x:  217   top_y:   45   width:   18   height:   59)
person: 94%     (left_x:  124   top_y:   51   width:   24   height:   88)
person: 84%     (left_x:  194   top_y:   45   width:   16   height:   54)
person: 82%     (left_x:  173   top_y:   42   width:   24   height:   80)
person: 68%     (left_x:  325   top_y:   50   width:   30   height:   65)
person: 66%     (left_x:  299   top_y:   51   width:   51   height:   80)
person: 65%     (left_x:  240   top_y:   49   width:   27   height:   70)
person: 59%     (left_x:  166   top_y:   41   width:   21   height:   78)
person: 54%     (left_x:  231   top_y:   42   width:   16   height:   56)
person: 28%     (left_x:  148   top_y:   40   width:   12   height:   34)

FPS:102.5    AVG_FPS:39.5
  1. Using darknet with Python python3 darknet_video.py
    
    from ctypes import *
    import math
    import random
    import os
    import cv2
    import numpy as np
    import time
    import darknet

def convertBack(x, y, w, h): xmin = int(round(x - (w / 2))) xmax = int(round(x + (w / 2))) ymin = int(round(y - (h / 2))) ymax = int(round(y + (h / 2))) return xmin, ymin, xmax, ymax

def cvDrawBoxes(detections, img): for detection in detections: x, y, w, h = detection[2][0],\ detection[2][1],\ detection[2][2],\ detection[2][3] xmin, ymin, xmax, ymax = convertBack( float(x), float(y), float(w), float(h)) pt1 = (xmin, ymin) pt2 = (xmax, ymax) cv2.rectangle(img, pt1, pt2, (0, 255, 0), 1) cv2.putText(img, detection[0].decode() + " [" + str(round(detection[1] * 100, 2)) + "]", (pt1[0], pt1[1] - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, [0, 255, 0], 2) return img

netMain = None metaMain = None altNames = None

def YOLO():

global metaMain, netMain, altNames
configPath = "/home/nvidia/darknet/backup/yolov4-tiny.cfg"
weightPath = "/home/nvidia/darknet/backup/yolov4-tiny.weights"
metaPath = "/home/nvidia/darknet/cfg/coco.data"
if not os.path.exists(configPath):
    raise ValueError("Invalid config path `" +
                     os.path.abspath(configPath)+"`")
if not os.path.exists(weightPath):
    raise ValueError("Invalid weight path `" +
                     os.path.abspath(weightPath)+"`")
if not os.path.exists(metaPath):
    raise ValueError("Invalid data file path `" +
                     os.path.abspath(metaPath)+"`")
if netMain is None:
    netMain = darknet.load_net_custom(configPath.encode(
        "ascii"), weightPath.encode("ascii"), 0, 1)  # batch size = 1
if metaMain is None:
    metaMain = darknet.load_meta(metaPath.encode("ascii"))
if altNames is None:
    try:
        with open(metaPath) as metaFH:
            metaContents = metaFH.read()
            import re
            match = re.search("names *= *(.*)$", metaContents,
                              re.IGNORECASE | re.MULTILINE)
            if match:
                result = match.group(1)
            else:
                result = None
            try:
                if os.path.exists(result):
                    with open(result) as namesFH:
                        namesList = namesFH.read().strip().split("\n")
                        altNames = [x.strip() for x in namesList]
            except TypeError:
                pass
    except Exception:
        pass
#cap = cv2.VideoCapture(0)
cap = cv2.VideoCapture("/home/nvidia/darknet/match5-c1.avi")
cap.set(3, 416)
cap.set(4, 416)
out = cv2.VideoWriter(
    "output.avi", cv2.VideoWriter_fourcc(*"MJPG"), 10.0,
    (darknet.network_width(netMain), darknet.network_height(netMain)))
print("Starting the YOLO loop...")

# Create an image we reuse for each detect
darknet_image = darknet.make_image(darknet.network_width(netMain),
                                darknet.network_height(netMain),3)
while True:
    prev_time = time.time()
    ret, frame_read = cap.read()
    frame_rgb = cv2.cvtColor(frame_read, cv2.COLOR_BGR2RGB)
    frame_resized = cv2.resize(frame_rgb,
                               (darknet.network_width(netMain),
                                darknet.network_height(netMain)),
                               interpolation=cv2.INTER_LINEAR)

    darknet.copy_image_from_bytes(darknet_image,frame_resized.tobytes())

    detections = darknet.detect_image(netMain, metaMain, darknet_image, thresh=0.25)
    image = cvDrawBoxes(detections, frame_resized)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    print(1/(time.time()-prev_time))
    cv2.imshow('Demo', image)
    cv2.waitKey(3)
cap.release()
out.release()

if name == "main": YOLO()


Results: Avg FPS = ~76 

77.39853481205367 77.07000845246408 76.6166885868771 71.84487838300788 75.80935167278183 72.04355966265308 77.52012715780135 70.36950540232199 70.17054522945142 71.75760893740056 71.93852908891328 76.39897996357013 76.65029239766082 77.73995885307582 71.17675807765409 76.92724171450581 75.70262611677646 76.28087660270982 73.66699452016299 77.5172617727508 77.28444288846714 77.25881854519332


3. For sake of completing the experiment, here are results of using openCV with FP32 (not FP16) to make it a fair comparison.
To do this I have changed `net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)` --> `net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)` (please correct me if this is not the proper way to do it.)

Results = Avg FPS ~105 FPS

FPS: 104.95 (excluding drawing time of 2.19ms) FPS: 108.71 (excluding drawing time of 1.93ms) FPS: 108.57 (excluding drawing time of 2.12ms) FPS: 109.00 (excluding drawing time of 2.36ms) FPS: 106.71 (excluding drawing time of 2.10ms) FPS: 107.48 (excluding drawing time of 1.97ms) FPS: 103.74 (excluding drawing time of 2.16ms) FPS: 103.69 (excluding drawing time of 2.10ms) FPS: 97.80 (excluding drawing time of 2.03ms) FPS: 107.07 (excluding drawing time of 1.94ms) FPS: 89.49 (excluding drawing time of 2.20ms) FPS: 110.30 (excluding drawing time of 2.13ms) FPS: 89.32 (excluding drawing time of 2.08ms) FPS: 111.05 (excluding drawing time of 2.39ms) FPS: 109.29 (excluding drawing time of 2.01ms) FPS: 109.68 (excluding drawing time of 1.95ms) FPS: 100.33 (excluding drawing time of 2.20ms) FPS: 109.78 (excluding drawing time of 1.90ms) FPS: 107.80 (excluding drawing time of 1.83ms) FPS: 103.31 (excluding drawing time of 1.88ms) FPS: 102.54 (excluding drawing time of 1.89ms) FPS: 103.53 (excluding drawing time of 1.77ms) FPS: 103.60 (excluding drawing time of 2.13ms) FPS: 102.80 (excluding drawing time of 2.18ms) FPS: 104.25 (excluding drawing time of 2.58ms)



In summary: even with FP32, opencv is still faster than darknet. 
Opencv at FP16 is **much faster** @ ~200FPS. 
hlacikd commented 3 years ago

In summary: even with FP32, opencv is still faster than darknet. Opencv at FP16 is much faster @ ~200FPS.

@marvision-ai awesome, your fast lol !

Its amazing to see how much difference does GPU architecture make. (maxwell on nano vs volta on nx). After your summary, i am definitely convinced to switch to opencv in production!

marvision-ai commented 3 years ago

@hlacik Please note, I am using the Xavier AGX not NX. :smile:

YashasSamaga commented 3 years ago

@marvision-ai

Can you measure the FPS using ./darknet detector demo cfg/coco.data cfg/yolov4-tiny.cfg yolov4-tiny.weights test.mp4 -benchmark?

Can you measure the OpenCV FP32 FPS with the new script (https://github.com/AlexeyAB/darknet/issues/6245#issuecomment-662592328) I shared? (which does not include preprocessing, postprocessing and NMS)

@hlacik

i forgot to include cpu time for cv2.resize to 416/416 px, (which is in case of model.detect for opencv dnn detection hidden inside)

model.detect() includes preprocessing (which will automatically resize any input to 416x416), NMS and postprocessing. To compare Darknet and OpenCV, you need to use https://github.com/AlexeyAB/darknet/issues/6245#issuecomment-662592328 and ./darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights test.mp4 -benchmark.

Its amazing to see how much difference does GPU architecture make. (maxwell on nano vs volta on nx).

Both Darknet and OpenCV use cuDNN. Therefore, the convolution performance will be exactly the same. OpenCV does a whole lot of optimizations but you will only notice these optimizations if the convolutions are fast; otherwise, the slow convolutions will blur all the improvements from OpenCV. So you will see OpenCV get increasingly faster than Darknet as your device gets better and better.

There are some open PRs and yet-to-open PRs for autotuning, layout optimization, etc. for OpenCV. So be sure to check again in future!

marvision-ai commented 3 years ago

@marvision-ai

Can you measure the FPS using ./darknet detector demo cfg/coco.data cfg/yolov4-tiny.cfg yolov4-tiny.weights test.mp4 -benchmark?

Can you measure the OpenCV FP32 FPS with the new script (#6245 (comment)) I shared? (which does not include preprocessing, postprocessing and NMS)

@YashasSamaga Sure thing! ./darknet detector demo cfg/coco.data cfg/yolov4-tiny.cfg yolov4-tiny.weights test.mp4 -benchmark

CUDA-version: 10000 (10000), cuDNN: 7.6.3, CUDNN_HALF=1, GPU count: 1  
 CUDNN_HALF=1 
 OpenCV version: 4.1.1

Result: FPS:146.5 AVG_FPS:143.2

Therefore, even with darknet at FP16 (i assume since it has CUDNN_HALF), Opencv is still 60FPS faster with its python version. I assume C++ opencv will be even faster.

Let me know if you want any other testing.

marvision-ai commented 3 years ago

There are some open PRs and yet-to-open PRs for autotuning, layout optimization, etc. for OpenCV. So be sure to check again in future!

@YashasSamaga I am very excited for this. I feel like that is what TensorRT currently has over the openCV version, since it creates an inference engine beforehand with layer optimization and kernal autotuning. If openCV can try to mimic this in some way, I think we may have an new leader.

Question: I built my opencv with the provided 4.4.0.zip ... To incorporate new PRs and potentially test new features performance in the future, do you recommend just cloning the Master branch from the repo and building directly from there?

YashasSamaga commented 3 years ago

I built my opencv with the provided 4.4.0.zip ... To incorporate new PRs and potentially test new features performance in the future, do you recommend just cloning the Master branch from the repo and building directly from there?

Yes, the newly merged PRs will be available in the master branch.

ryumansang commented 3 years ago

@YashasSamaga Thank you for your reply. Unfortunately, I'm having one problem.

Using the dnnn_detection model you have created is a significant slowdown.

My current benchmark is as follows. Darknet: 110fps darknet python: 50 fps dnn_DetectionModel: 19fps net.forward: 360fps

The above results suggest that post-processing (nms, bbox, etc.) may take considerable time. I'm having a hard time trying to solve this problem with cython. But I don't understand why the difference between darknet and opencv (50 fps->19 fps) occurs when the same python is used?

YashasSamaga commented 3 years ago

My current benchmark is as follows. Darknet: 110fps darknet python: 50 fps dnn_DetectionModel: 19fps net.forward: 360fps

Can you share the command/code you used to get these numbers?

Something doesn't look right. Preprocessing, postprocessing and NMS cannot be a bottleneck to an extent that it caps the FPS to ~19.

ryumansang commented 3 years ago

@YashasSamaga

import cv2 import time

CONFIDENCE_THRESHOLD = 0.2 NMS_THRESHOLD = 0.4 COLORS = [(0, 255, 255), (255, 255, 0), (0, 255, 0), (255, 0, 0)]

class_names = [] with open("cfg/coco.names", "r") as f: class_names = [cname.strip() for cname in f.readlines()]

img = cv2.imread('data/dog.jpg')

net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg") net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

model = cv2.dnn_DetectionModel(net) model.setInputParams(size=(416, 416), scale=1/256)

while True: start = time.time() classes, scores, boxes = model.detect(img, CONFIDENCE_THRESHOLD, NMS_THRESHOLD) print(1 / (time.time() - start))

=================== You have used that code.

The model used yolov4 and custom models, and the benchmark above is based on the custom model.

However, the same thing happened to yolov4. One thing I could see was net.forward, which was able to get a significant fps, but it took a tremendous amount of time to calculate the bbox, so I could get the same speed at 19 fps.)

We tried in various environments considering that it was a hardware problem, but the results were the same...

My settings are as follows.

  1. opencv 4.4.0
  2. cuda 10.2 & cudnn 7.6.5
  3. Python 3.6

For Hardware

  1. CPU: i9 9850x
  2. GPU: RTX2080 Ti
  3. RAM: 32GB
  4. SSD Usage

For Custom Models 1.yolov4-tiny model

marvision-ai commented 3 years ago

@YashasSamaga Do you mind letting us know when new features like auto-tuning and layout optimization are released? I would love to be able to benchmark them for you but have no idea when they are due to be finished.

YashasSamaga commented 3 years ago

@marvision-ai I just closed the current autotuning PR (opencv/16900) since cuDNN 8 has a new API. I will have to add complete support for cuDNN 8 and then add autotuning and layout optimizer. I don't really know when it will be available but it's going to be at least a month.

You can find some stats of YOLOv4 with autotuning at https://github.com/opencv/opencv/pull/17748 but the numbers are outdated (it should be faster than the timings reported there).

marvision-ai commented 3 years ago

@YashasSamaga Okay great. Thanks for the update. Was it confirmed that there are performance regressions with the cuDNN 8 when using cv2 dnn support?

vtyw commented 3 years ago

Hi @YashasSamaga I've been trying to replicate your published FPS performance with OpenCV dnn on RTX 2080 Ti to no avail; I'm getting half the speed in some cases. OS: Ubuntu 18.04 LTS CPU: i9-9900X CPU GPUs: RTX 2080, RTX 2080 Ti CUDA version: 10.0.130 cuDNN version: 7.6.5

The way I'm building OpenCV master:

cmake \
      -D BUILD_opencv_core=ON \
      -D BUILD_opencv_cudev=ON \
      -D BUILD_opencv_dnn=ON \
      -D BUILD_opencv_highgui=ON \
      -D BUILD_opencv_imgcodecs=ON \
      -D BUILD_opencv_imgproc=ON \
      -D BUILD_opencv_python3=ON \
      -D BUILD_opencv_python_bindings_generator=ON \
      -D BUILD_opencv_videoio=ON \
      -D CMAKE_BUILD_TYPE=Release \
      -D CMAKE_CONFIGURATION_TYPES=Release \
      -D CMAKE_INSTALL_PREFIX=/home/victor/opencv-master/install \
      -D CUDA_FAST_MATH=ON \
      -D CUDA_GENERATION=Turing \
      -D ENABLE_FAST_MATH=ON \
      -D OPENCV_DNN_CUDA=ON \
      -D OPENCV_ENABLE_NONFREE=ON \
      -D OPENCV_EXTRA_MODULES_PATH=/home/victor/opencv-master/opencv_contrib/modules \
      -D WITH_CUBLAS=ON \
      -D WITH_CUDA=ON \
      -D WITH_CUDNN=ON \
      -D WITH_OPENGL=ON \
      -D WITH_QT=ON \
      -D WITH_TBB=ON \
      -D WITH_V4L=ON \
      -D OpenGL_GL_PREFERENCE=GLVND \
      ..

The cmake output looks like this:

Detected processor: x86_64
Looking for ccache - found (/usr/bin/ccache)
Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found suitable version "1.2.11", minimum required is "1.2.3") 
Could NOT find OpenJPEG (minimal suitable version: 2.0, recommended version >= 2.3.1)
Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11") 
Found OpenEXR: /usr/lib/x86_64-linux-gnu/libIlmImf.so
Found TBB (env): /usr/lib/x86_64-linux-gnu/libtbb.so
found Intel IPP (ICV version): 2020.0.0 [2020.0.0 Gold]
at: /home/victor/opencv-master/build/3rdparty/ippicv/ippicv_lnx/icv
found Intel IPP Integration Wrappers sources: 2020.0.0
at: /home/victor/opencv-master/build/3rdparty/ippicv/ippicv_lnx/iw
CUDA detected: 10.0
CUDA NVCC target flags: -gencode;arch=compute_75,code=sm_75;-D_FORCE_INLINES
Could not find OpenBLAS include. Turning OpenBLAS_FOUND off
Could not find OpenBLAS lib. Turning OpenBLAS_FOUND off
Could NOT find Atlas (missing: Atlas_CLAPACK_INCLUDE_DIR) 
A library with LAPACK API found.
Found apache ant: /usr/bin/ant (1.10.5)
Could NOT find JNI (missing: JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH) 
VTK is not found. Please set -DVTK_DIR in CMake to VTK build directory, or to VTK install subdirectory with VTKConfig.cmake file
OpenCV Python: during development append to PYTHONPATH: /home/victor/opencv-master/build/python_loader
Caffe:   NO
Protobuf:   NO
Glog:   YES
freetype2:   YES (ver 21.0.15)
harfbuzz:    YES (ver 1.7.2)
HDF5: Using hdf5 compiler wrapper to determine C configuration
Julia not found. Not compiling Julia Bindings. 
Module opencv_ovis disabled because OGRE3D was not found
No preference for use of exported gflags CMake configuration set, and no hints for include/library directories provided. Defaulting to preferring an installed/exported gflags CMake configuration if available.
Found installed version of gflags: /usr/lib/x86_64-linux-gnu/cmake/gflags
Detected gflags version: 2.2.1
Checking SFM deps... TRUE
CERES support is disabled. Ceres Solver for reconstruction API is required.
Tesseract:   YES (ver 4.0.0-beta.1)
Allocator metrics storage type: 'long long'
Registering hook 'INIT_MODULE_SOURCES_opencv_dnn': /home/victor/opencv-master/modules/dnn/cmake/hooks/INIT_MODULE_SOURCES_opencv_dnn.cmake

General configuration for OpenCV 4.4.0-dev =====================================
  Version control:               4.4.0-9-gb698d0a6ee

  Extra modules:
    Location (extra):            /home/victor/opencv-master/opencv_contrib/modules
    Version control (extra):     4.4.0-2-gbdc01011

  Platform:
    Timestamp:                   2020-07-28T05:31:00Z
    Host:                        Linux 4.15.0-109-generic x86_64
    CMake:                       3.15.2
    CMake generator:             Unix Makefiles
    CMake build tool:            /usr/bin/make
    Configuration:               Release

  CPU/HW features:
    Baseline:                    SSE SSE2 SSE3
      requested:                 SSE3
    Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
      requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
      SSE4_1 (13 files):         + SSSE3 SSE4_1
      SSE4_2 (1 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
      FP16 (0 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
      AVX (4 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
      AVX2 (25 files):           + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
      AVX512_SKX (3 files):      + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX

  C/C++:
    Built as dynamic libs?:      YES
    C++ standard:                11
    C++ Compiler:                /usr/bin/c++  (ver 7.5.0)
    C++ flags (Release):         -fsigned-char -ffast-math -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG  -DNDEBUG
    C++ flags (Debug):           -fsigned-char -ffast-math -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -g  -O0 -DDEBUG -D_DEBUG
    C Compiler:                  /usr/bin/cc
    C flags (Release):           -fsigned-char -ffast-math -W -Wall -Werror=return-type -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -O3 -DNDEBUG  -DNDEBUG
    C flags (Debug):             -fsigned-char -ffast-math -W -Wall -Werror=return-type -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -g  -O0 -DDEBUG -D_DEBUG
    Linker flags (Release):      -Wl,--exclude-libs,libippicv.a -Wl,--exclude-libs,libippiw.a   -Wl,--gc-sections -Wl,--as-needed  
    Linker flags (Debug):        -Wl,--exclude-libs,libippicv.a -Wl,--exclude-libs,libippiw.a   -Wl,--gc-sections -Wl,--as-needed  
    ccache:                      YES
    Precompiled headers:         NO
    Extra dependencies:          m pthread /usr/lib/x86_64-linux-gnu/libOpenGL.so /usr/lib/x86_64-linux-gnu/libGLX.so /usr/lib/x86_64-linux-gnu/libGLU.so cudart_static -lpthread dl rt nppc nppial nppicc nppicom nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cudnn cufft -L/usr/local/cuda-10.0/lib64 -L/usr/lib/x86_64-linux-gnu
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 core cudev dnn highgui imgcodecs imgproc python3 videoio
    Disabled:                    alphamat aruco bgsegm bioinspired calib3d ccalib cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cvv datasets dnn_objdetect dnn_superres dpm face features2d flann freetype fuzzy gapi hdf hfs img_hash intensity_transform java_bindings_generator line_descriptor ml objdetect optflow phase_unwrapping photo plot python2 python_tests quality rapid reg rgbd saliency sfm shape stereo stitching structured_light superres surface_matching text tracking ts video videostab world xfeatures2d ximgproc xobjdetect xphoto
    Disabled by dependency:      -
    Unavailable:                 cnn_3dobj java js julia matlab ovis viz
    Applications:                -
    Documentation:               NO
    Non-free algorithms:         YES

  GUI: 
    QT:                          YES (ver 5.9.5)
      QT OpenGL support:         YES (Qt5::OpenGL 5.9.5)
    GTK+:                        NO
    OpenGL support:              YES (/usr/lib/x86_64-linux-gnu/libOpenGL.so /usr/lib/x86_64-linux-gnu/libGLX.so /usr/lib/x86_64-linux-gnu/libGLU.so)
    VTK support:                 NO

  Media I/O: 
    ZLib:                        /usr/lib/x86_64-linux-gnu/libz.so (ver 1.2.11)
    JPEG:                        /usr/lib/x86_64-linux-gnu/libjpeg.so (ver 80)
    WEBP:                        /usr/lib/x86_64-linux-gnu/libwebp.so (ver encoder: 0x020e)
    PNG:                         /usr/lib/x86_64-linux-gnu/libpng.so (ver 1.6.34)
    TIFF:                        /usr/lib/x86_64-linux-gnu/libtiff.so (ver 42 / 4.0.9)
    JPEG 2000:                   /usr/lib/x86_64-linux-gnu/libjasper.so (ver 1.900.1)
    OpenEXR:                     /usr/lib/x86_64-linux-gnu/libImath.so /usr/lib/x86_64-linux-gnu/libIlmImf.so /usr/lib/x86_64-linux-gnu/libIex.so /usr/lib/x86_64-linux-gnu/libHalf.so /usr/lib/x86_64-linux-gnu/libIlmThread.so (ver 2_2)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DC1394:                      YES (2.2.5)
    FFMPEG:                      YES
      avcodec:                   YES (57.107.100)
      avformat:                  YES (57.83.100)
      avutil:                    YES (55.78.100)
      swscale:                   YES (4.8.100)
      avresample:                YES (3.7.0)
    GStreamer:                   YES (1.14.5)
    v4l/v4l2:                    YES (linux/videodev2.h)

  Parallel framework:            TBB (ver 2017.0 interface 9107)

  Trace:                         YES (with Intel ITT)

  Other third-party libraries:
    Intel IPP:                   2020.0.0 Gold [2020.0.0]
           at:                   /home/victor/opencv-master/build/3rdparty/ippicv/ippicv_lnx/icv
    Intel IPP IW:                sources (2020.0.0)
              at:                /home/victor/opencv-master/build/3rdparty/ippicv/ippicv_lnx/iw
    Lapack:                      NO
    Eigen:                       YES (ver 3.3.4)
    Custom HAL:                  NO
    Protobuf:                    build (3.5.1)

  NVIDIA CUDA:                   YES (ver 10.0, CUFFT CUBLAS FAST_MATH)
    NVIDIA GPU arch:             75
    NVIDIA PTX archs:

  cuDNN:                         YES (ver 7.6.5)

  OpenCL:                        YES (no extra features)
    Include path:                /home/victor/opencv-master/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python 3:
    Interpreter:                 /usr/bin/python3 (ver 3.6.9)
    Libraries:                   /usr/lib/x86_64-linux-gnu/libpython3.6m.so (ver 3.6.9)
    numpy:                       /home/victor/.local/lib/python3.6/site-packages/numpy/core/include (ver 1.16.1)
    install path:                lib/python3.6/dist-packages/cv2/python-3.6

  Python (for build):            /usr/bin/python2.7

  Java:                          
    ant:                         /usr/bin/ant (ver 1.10.5)
    JNI:                         NO
    Java wrappers:               NO
    Java tests:                  NO

  Install to:                    /home/victor/opencv-master/install
-----------------------------------------------------------------

Configuring done

Then I'm compiling and running your benchmark code https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf using:

# LD_LIBRARY_PATH is /usr/local/cuda:/home/victor/opencv-master/install/lib/
g++ -I /home/victor/opencv-master/install/include/opencv4/ -I /home/victor/opencv-master/install/include/opencv4/opencv2 -L/home/victor/opencv-master/install/lib/ benchmark.cpp -lopencv_core -lopencv_imgproc -lopencv_dnn -lopencv_imgcodecs -O3 -std=c++17
CUDA_VISIBLE_DEVICES=1 ./a.out

If the result for FP16 with batch size 1 is like this:

YOLO v4
[CUDA FP16]
    init >> 1703.52ms
    inference >> min = 8.659ms, max = 11.054ms, mean = 8.92965ms, stddev = 0.663459ms

The results I'm getting for the mean times are:

Batch   FP16    FP32    FPS16   FPS32
1   8.93    17.20   112 58
2   14.64   30.37   137 66
4   27.29   57.77   147 69
8   52.43   105.59  153 76
16  103.63  216.45  154 74

My FPS with 416x416 (default size in benchmark script), batch size 4 and FP16 is 147, compared to your 294 FPS. Batching doesn't improve the performance much for me, and all my base times with batch = 1 are noticeably slower than yours too.

Any ideas on what I'm missing that explains this performance gap? The only thing I'm aware of is you're using CUDA 10.2 and I'm using 10.0, but I didn't think that was enough to explain this.

YashasSamaga commented 3 years ago

@marvision-ai

Was it confirmed that there are performance regressions with the cuDNN 8 when using cv2 dnn support?

Yes.

@vtyw

My FPS with 416x416 (default size in benchmark script), batch size 4 and FP16 is 147, compared to your 294 FPS. Batching doesn't improve the performance much for me, and all my base times with batch = 1 are noticeably slower than yours too.

The default size in the benchmark script is 608 x 608. Note that the image size is set for each model in my code.

void bench_yolo_v4()
{
    std::cout << "YOLO v4\n";
    bench_network("data/yolov4/yolov4.cfg", "data/yolov4/yolov4.weights", cv::Size(608, 608));
    std::cout << std::endl;
}

If I compare your timings with mine for 608 x 608, they match really well. They differ by at most 3 FPS. So I strongly believe that you measured 608 x 608 timings.

Input Size Mine / Yours FP32 (batch = 1) Mine / Yours FP32 (batch = 4) Mine / Yours FP16 (batch = 1) Mine / Yours FP16 (batch = 4)
608 x 608 60 / 58 72 / 69 115 / 112 149 / 147

Also the standard deviation is high in yours. Try running the program a few times and check that no other application is using your GPU.

    inference >> min = 8.659ms, max = 11.054ms, mean = 8.92965ms, stddev = 0.663459ms

while mine is

    inference >> min = 8.651ms, max = 8.683ms, mean = 8.66774ms, stddev = 0.00478416ms

And I did not set the RTX 2080 Ti's power mode to maximum performance for benchmarks. You can probably surpass the numbers I have reported if you force the GPU to run at P0 (this is not even overclocking). I intentionally choose to not tweak the performance mode because most people wouldn't do it.

Also, you might get a slightly higher FPS if you wait for your GPU to cool down to prevent it from throttling during the benchmark. People can easily cheat in benchmarks. I have underreported on purpose. So it feels weird that you are actually 2-3 FPS slower. I would expect a regular user's benchmark to surpass my numbers.

Any ideas on what I'm missing that explains this performance gap?

If what I just said explains your problem, let me know. You will be the first to confirm my benchmarks.

vtyw commented 3 years ago

@YashasSamaga

The default size in the benchmark script is 608 x 608. Note that the image size is set for each model in my code. If what I just said explains your problem, let me know. You will be the first to confirm my benchmarks.

You're quite right, my benchmarks were for 608x608. I made a careless assumption when looking through the script about the settings being the same as YOLOv3.

I have redone the benchmarks for 416x416 and 608x608, making sure to close processes taking up CPU and there are zero other processes running on the RTX 2080 Ti (since it's my secondary card). Also set the GPU to Prefer Maximum Performance power mode, though I have not been able to permanently set the performance mode to P0 (usually it's P2 under high load or P8 under mild load).

My FPS numbers still tend to be within -2 to +1 of your results and the published numbers for pure darknet. I notice that the standard deviations of my results are lower now but still definitively higher than yours.

LionelLeee commented 3 years ago

@YashasSamaga @hlacik I have the same problem,In the case of FP16, whether it is c++ or python, the fps is only 0.2-0.4, I am very confused. I use cuda10+cudnn7.6.0 and cuda10+cudnn7.6.3 and there is no change. But when using dnn.DNN_TARGET_CUDA, Python inference fps can reach 13-14fps, and c++ can reach 18-19fps. My cpu is GTX 1060. Why is this happening?

hlacikd commented 3 years ago

@YashasSamaga @hlacik I have the same problem,In the case of FP16, whether it is c++ or python, the fps is only 0.2-0.4, I am very confused. I use cuda10+cudnn7.6.0 and cuda10+cudnn7.6.3 and there is no change. But when using dnn.DNN_TARGET_CUDA, Python inference fps can reach 13-14fps, and c++ can reach 18-19fps. My cpu is GTX 1060. Why is this happening?

I believe that GTX 10XX have no native FP16 (aka half precision) support.

Shame-fight commented 3 years ago

So which method (opencv/darknet/tensorrt) can get the fastest speed on jetson-nano/tx1/tx2? who can tell me~

yancccc commented 3 years ago

@YashasSamaga For the detection effect, which is better, yolov4 Darknet or yolov4 opencv

YashasSamaga commented 3 years ago

@yancccc I did not understand what you meant by "detection effect". OpenCV is faster than Darknet. OpenCV and Darknet give identical detection performance.

You can find more information here.