Best way to use YOLO on Jetson Xavier with max FPS

Kmarconi commented 4 years ago

Hi ! First thanks for the continuous update you are making on your repo, it's amazing. I'm working on a project on which i would like to be able to detect only one class of object but at a high speed ( at least 60 FPS). I just tested your yolov4 files and the yolov3 pruned weight and I'm blocked at 5 FPS on my Xavier whereas if I remember well, i was around 20 FPS with yolov3. I know that yolov4 is heavier than yolov3 but I was hopping that the pruned version of yolov3 would rise in terms of FPS but it did not and I think I did something wrong.

To compile darknet, I've put the flags GPU,CUDNN,CUDNN_HALF and OPENCV to 1. I also uncommented the ARCH version for the Xavier. Do I need to do something else ?

For now, i'm able to run some object detection algorithm at a speed of 150 FPS (ssd-inception) on the xavier but I really would like to use yolo because of it's accuracy. I know that I need to use TensorRT, the quantization so that the weight use FP16 or INT8 and not FP32 and I know how to do it with Tensorflow, but with darknet i'm kinda lost. Can you give me some help ?

Ps : I know that deepstream supports YOLO natively but I would like to do a python or C++ object-detection app and I'm not sure that It is possible to "import" the deepstream pipeline in a Python app and get the detected object from it.

Best Regards, SOrry for this long message but I'm passionate about YOLO ^^

DocF commented 4 years ago

In my view, for detection in one class, as long as it is not a dense small object, yolov3-tiny is enough.

AlexeyAB commented 4 years ago

know that deepstream supports YOLO natively but I would like to do a python or C++ object-detection app

Deepstream isn't C++ app? https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps and https://github.com/NVIDIA-AI-IOT/deepstream_4.x_apps

I know that I need to use TensorRT, the quantization so that the weight use FP16 or INT8 and not FP32

Yes, you can try to do INT8 quantization with TensorRT + Deepstream.

i would like to be able to detect only one class of object but at a high speed ( at least 60 FPS).

Very approximately for yolov4.cfg

width=416 height=416 in cfg - 9 FPS
width=320 height=320 in cfg - 13 FPS
width=320 height=320 in cfg INT8-TensorRT - 25 FPS
width=320 height=320 in cfg INT8-TensorRT batch=32 - 50 FPS

I made repo with INT8 quantization for Yolov2/v3 but it doesn't support Yolov4 https://github.com/AlexeyAB/yolo2_light

So may be better for you to use yolov3-tiny-prn.cfg

modern_gpus

marcoslucianops commented 4 years ago

Ps : I know that deepstream supports YOLO natively but I would like to do a python or C++ object-detection app and I'm not sure that It is possible to "import" the deepstream pipeline in a Python app and get the detected object from it.

You can get metadata from deepstream in Python and C. For C, you need edit deepstream-app or deepstream-test code. For Python your need install and edit this.

You need manipulate NvDsObjectMeta, NvDsFrameMeta and NvOSD_RectParams to get label, position, etc. of bboxs.

In C deepstream-app aplication, your code need be in analytics_done_buf_prob function. In C/Python deepstream-test application, your code need be in tiler_src_pad_buffer_probe function.

Example using C: https://www.youtube.com/watch?v=eFv4P1oj9pA Example using Python: https://www.youtube.com/watch?v=n3uYS550PDo

Python is slightly slower than C (on Jetson Nano, ~2FPS).

Kmarconi commented 4 years ago

Hi, first thanks for your 3 quick replies ! Since I would like my model to detect objects which could be big at foreground but also could be small at the very background of the image, I'm not sure Yolov3-tiny is valuable option for me ? Correct me if i'm wrong but I know that YOlov3 is analyzing the image at three different scale which is a good feature for my purpose. But it is done with 106 convolutions layers and I don't know if the few layers from yolov3-tiny could be enough to detect one object at a large and a small scale. Will take a look to your links @marcoslucianops thanks ! :) And thanks for your answer too @AlexeyAB :)

AlexeyAB commented 4 years ago

@Kmarconi @marcoslucianops You can use Yolov4 on tensorRT using tkDNN with 32 FPS(FP16) / 17 FPS(FP32) with batch=1 on AGX Xavier: https://github.com/AlexeyAB/darknet/issues/5354#issuecomment-621115435

With batch=4 FPS will be higher.

Kmarconi commented 4 years ago

Thanks ! WIll give it a try !

marcoslucianops commented 4 years ago

@AlexeyAB, I will compare tkDNN and DeepStream. Thanks!

Kmarconi commented 4 years ago

To keep you updated, I'm actually around 34 FPS with yolov4 on the Xavier with tkDNN.

AlexeyAB commented 4 years ago

@Kmarconi What batch-size, network resolution, and Float-point-precision (32/16) do you use?

Kmarconi commented 4 years ago

I'm using batch_size of 4, fp16 mode and didn't touch to the network resolution for the moment so the default yolov4 one

AlexeyAB commented 4 years ago

So do you get 34 FPS on Jetson Xavier by using yolov4.cfg width=608 height=608 batch_size=4 fp16 by using tkDNN+TensorRT?

Kmarconi commented 4 years ago

Sorry for the late response, I'm working in France so I'm not awaken in the same hours as you are ^^ I'm using width=416 height=416 batch=4 and fp16 with tkDNN+TRT to get 34 FPS on the Xavier yes ! :) I know that it is probably something too hard or too time consuming to do but it would be amazing to see one day an easy integration of TensorRT in the darknet project for every gpu architecture which support it. Will continue to test tkDNN today, will keep posting results

marcoslucianops commented 4 years ago

@AlexeyAB, DeepStream is faster than tkDNN. tkDNN shows 45.381ms inference time buts diplay video seems like 10-15 fps on Jeston Nano. I think it's due to OpenCV.

AlexeyAB commented 4 years ago

@mive93 Hi, Can you comment on this?

mive93 commented 4 years ago

Hi @marcoslucianops, how are you using tkDNN? Have you enabled FP16 inference? Have you enabled preprocessing on GPU? We have never tested tkDNN on a Jetson Nano, so I do not have data on that. However, yes, you are right, OpenCV could be a problem for performances.

Hi @Kmarconi

To keep you updated, I'm actually around 34 FPS with yolov4 on the Xavier with tkDNN.

How did you obtain this number? I think you are doing something wrong, those are the FPS with batch = 1.

marcoslucianops commented 4 years ago

Have you enabled FP16 inference?

I compared DeepStream FP32 vs tkDNN FP32

Have you enabled preprocessing on GPU?

Yes

I think that's problem (delay) in OpenCV when write bbox and imshow.

AlexeyAB commented 4 years ago

@mive93

I think, for tkDNN

it shouldn't show all frames on the screen. So CPU-thread that shows detections on the screen should work asynchronously and shows only the last frame
if it is implemented, it should write all frames to the output.avi file

Kmarconi commented 4 years ago

Hi @mive93 ,

Yeah I just saw that I was mistaken about the batch_size. Haven't seen

The test will still run with a batch of 1, but the created tensorRT can manage the desidered batch size.

So even if I export the batch_size variable to 4 for example, I will do my inference with only a batch_size of 1 ? Then how can I use the full potential of my trt engine ?

PS : 160 FPS with mobilenet on the Xavier,woaw. ^^

mive93 commented 4 years ago

@AlexeyAB @marcoslucianops yeah, it's due to OpenCV. And @AlexeyAB you are right, we should insert some flag to disable the graphics. However it is thought to be a library, so the demo is just an example, it's not how you use it. Ofc when I use it in other projects, the graphic part is handled by other tasks. But I could add a demo like that maybe.

@Kmarconi thanks :) Right now the batch can be only tested to check the FPS (using the rt_interence test). But this week I'm planning to allow using it in a demo, so that anyone can test it for real with more batches. It was a WIP.

AlexeyAB commented 4 years ago

@mive93 You can just add such part of code for bbox_drawing(), wait_key_cv() and show() functions, so these functions will be used no more than a 100 times per 1 second in the Demo: https://github.com/AlexeyAB/darknet/blob/0c7305cf638bd0e692c6195e412aff93200000e4/src/demo.c#L295-L296

harsco-jfernandez commented 4 years ago

How are all getting xavier to work at 34 FPS? I'm only able to get 24FPS!

I've set the following and my model is 320x320, not 416x416 like yours is.

TKDNN_BATCHSIZE=4 TKDNN_MODE=FP16

What else do I need?

yolo4_fp16.rt New NetworkRT (TensorRT v6.01) Float16 support: 1 Int8 support: 1 DLAs: 2 create execution context Input/outputs numbers: 4 input idex = 0 -> output index = 3 Data dim: 1 3 320 320 1 Data dim: 1 33 10 10 1 RtBuffer 0 dim: Data dim: 1 3 320 320 1 RtBuffer 1 dim: Data dim: 1 33 40 40 1 RtBuffer 2 dim: Data dim: 1 33 20 20 1 RtBuffer 3 dim: Data dim: 1 33 10 10 1 ===== TENSORRT detection ==== Time: 0.725123 ms Data dim: 1 3 320 320 1 Time: 19.7376 ms Data dim: 1 33 10 10 1 Time: 0.585052 ms

===== TENSORRT detection ==== Time: 0.71021 ms Data dim: 1 3 320 320 1 Time: 19.7166 ms Data dim: 1 33 10 10 1 Time: 0.396787 ms

===== TENSORRT detection ==== Time: 0.676224 ms Data dim: 1 3 320 320 1 Time: 19.7656 ms Data dim: 1 33 10 10 1 Time: 0.360881 ms

===== TENSORRT detection ==== Time: 0.758276 ms Data dim: 1 3 320 320 1 Time: 19.7501 ms Data dim: 1 33 10 10 1 Time: 0.458837 ms

Kmarconi commented 4 years ago

Are you in MAXN mode and did you used sudo /usr/bin/jetson_clocks?

mive93 commented 4 years ago

Hi @harsco-jfernandez ,

first, @Kmarconi is right. How did you create the rt file? Which command did you run to print those results? How did you use the batches? Right now batches are supported in the demo only in the branch eval.

harsco-jfernandez commented 4 years ago

Thank you, fellows!

Your questions are as good as answers. I made some assumptions. It is running at 40 FPS now.

AlexeyAB commented 4 years ago

@harsco-jfernandez 40 FPS is a good speed for Yolov4 on Jetson Xavier AGX.

harsco-jfernandez commented 4 years ago

@AlexeyAB It is excellent! I love it!

I'm now trying int8 inference. My camera is capable of 100 fps.

rafcy commented 4 years ago

Has anyone tested the performance on jetson xavier nx instead of AGX? (it's almost half the price of AGX)

mive93 commented 4 years ago

Hi @rafcy, Not yet, I'm waiting for the board to be shipped. But soonish I will do some tests on the nano.

AlexeyAB / darknet

Best way to use YOLO on Jetson Xavier with max FPS #5386