NVIDIA-AI-IOT / yolov5_gpu_optimization

This repository provides YOLOV5 GPU optimization sample
GNU General Public License v3.0
100 stars 27 forks source link

Benefits of using yolov5_decode.so #7

Closed YoungjaeDev closed 1 year ago

YoungjaeDev commented 1 year ago

Hello Viewing programs for tensorrt_example At first, I thought it was fast because it was simply written in cpp, but the time for tensorrt engine forward seems to be no difference.

  1. What I understand is that can I say that yolov5_decode.so is used to get(parse) the output from the end of the engine to make(develop) the nms faster?
  2. G_YOLOV5_DECODE_LIB = ctypes.cdll.LoadLibrary('../deepstream-sample/yolov5_decode.so') Isn't this part of the t ensorrt Sample?
Tyler-D commented 1 year ago
  1. Yes. The yolov5_decode.so is to parse the raw output of YOLOV5 backbone to the bboxes that can be consumed by the NMS algorithm.
  2. No, yolov5_decode.so is not used in tensorrt sample for inference and evaluation. The reason we load it in the tensorrt sample script is to leverage the int8 calibration generation pipeline in the tensorrt sample if users want to deploy int8 in the DeepStream.

In tensorrt sample, the yolov5 decoder is build with multiple separate onnx operations and it will run on GPU by TensorRT and NMS will also run on GPU by TensorRT plugin --- All on GPU and almost the same results as Pytorch.

In deepstream sample, the yolov5 decoder is build with one single kernel --- it will be a little faster than separate onnx operations and the NMS will run on CPU with DeepStream NMS module --- It is a optimized CPU version NMS but it is not same as in Pytorch. This version NMS will lead to mAP loss when you evaluate it on COCO but it is suitable for deployment situation --- faster and not that many FPs when you are pursing high mAP with low confidence threshold.

YoungjaeDev commented 1 year ago

Oh, thank you. My purpose is to find a faster one from the perspective of the inference pipeline.

  1. I will test the tensorrt sample, but there is not much difference in the result with the torch(Based on what you're saying,). can you say that it can bring speed advantage in gpu-nms or nms-scorebit compared to torch?
  2. I'm looking for a fast inference pipeline based on the fp16 model, do you have any recommendations? I think it would be good to treat preprocess(letterbox) and scale_code with cpp.
Tyler-D commented 1 year ago

Oh, thank you. My purpose is to find a faster one from the perspective of the inference pipeline.

  1. I will test the tensorrt sample, but there is not much difference in the result with the torch(Based on what you're saying,). can you say that it can bring speed advantage in gpu-nms or nms-scorebit compared to torch?
  2. I'm looking for a fast inference pipeline based on the fp16 model, do you have any recommendations? I think it would be good to treat preprocess(letterbox) and scale_code with cpp.

Hi,

  1. the result here means the mAP on COCO not the performance or FPS. The FPS is higher because of gpu-nms. According to the official repo, the NMS is ~1 ms per image. In my test on V100, when bs = 32, and the NMS takes 10ms for 32 images.

  2. For deployment, I highly recommend you follow deepstream sample to leverage the NV's multimedia pipeline software DeepStream. It covers all the things you care in deployment.

YoungjaeDev commented 1 year ago

@Tyler-D Thank you. Lastly, in the table below, how do I interpret 1 stream bs=1 or FPS bs=32? The higher the value, the better the performance, but I don't understand that fps bs=32 is greater than bs=1!

https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization#performance-summary https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization#performancemap-summary

Tyler-D commented 1 year ago
  1. In DeepStream sample section, 1 stream means 1 video stream and it will lead to batch_size=1 inference in DeepStream pipeline (This pipeline will cover video decode -> preprocess -> model inference -> output decode -> NMS -> final output coordinates). We record the FPS of the deepstream pipeline in this table, https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization#performance-summary
  2. In TensorRT sampel section, bs = 32 means inference with batch size = 32. We record the FPS in this table https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization#performancemap-summary. It is normal that bs=32 can have higher FPS than bs=1 as bs = 1 cannot occupy the powerful gpu (like V100) while bs=32 can make full use of it. For example, suppose the bs=1 will take 1s, bs=32 takes 20 s due to the parallesim in GPU. Then the FPS of bs = 32 is 1.6 > FPS of bs=1 which is 1.