NVIDIA-AI-IOT / yolo_deepstream

yolo model qat and deploy with deepstream&tensorrt
Apache License 2.0
533 stars 135 forks source link

About converting YOLOv7 QAT model to TensorRT engine(failed for dynamic-batch setting) #46

Open YunghuiHsu opened 1 year ago

YunghuiHsu commented 1 year ago

When I refer to yolo_deepstream/tree/main/tensorrt_yolov7 and use "yolov7QAT" to perform a batch detection task, the following error occurs ./build/detect --engine=yolov7QAT.engine --img=./imgs/horses.jpg,./imgs/zidane.jpg

Error Message

input 2 images, paths: ./imgs/horses.jpg, ./imgs/zidane.jpg, 
--------------------------------------------------------
Yolov7 initialized from: /opt/nvidia/deepstream/deepstream/samples/models/tao_pretrained_models/yolov7/yolov7QAT.engine
input : images , shape : [ 1,3,640,640,]
output : outputs , shape : [ 1,25200,85,]
--------------------------------------------------------
preprocess start
error cv_img.size() in preProcess
 error: mImgPushed = 1 numImg = 1 mMaxBatchSize= 1, mImgPushed + numImg > mMaxBatchSize 
inference start
postprocessing start
detectec image written to: ./imgs/horses.jpgdetect0.jpg

Note

Environment

    CUDA: 11.4.315
    cuDNN: 8.6.0.166
    TensorRT: 5.1
    Python: 3.8.10
    PyTorch: 1.12.0a0+2c916ef.nv22.3
Hardware
    Model: Jetson-AGX
    Module: NVIDIA Jetson AGX Xavier (32 GB ram)
    L4T: 35.2.1
    Jetpack: 5.1
YunghuiHsu commented 1 year ago

in https://github.com/NVIDIA-AI-IOT/yolo_deepstream/tree/main/tensorrt_yolov7#prepare-tensorrt-engines

Suggest explicitly specifying “dynamic-batch”, and the problem was solved!

Peplace

# int8 QAT model, the onnx model with Q&DQ nodes
/usr/src/tensorrt/bin/trtexec --onnx=yolov7qat.onnx --saveEngine=yolov7QAT.engine --fp16 --int8

with

# int8 QAT model, the onnx model with Q&DQ nodes and dynamic-batch
/usr/src/tensorrt/bin/trtexec --onnx=yolov7qat.onnx \
        --minShapes=images:1x3x640x640 \
        --optShapes=images:12x3x640x640 \
        --maxShapes=images:16x3x640x640 \
        --saveEngine=yolov7QAT.engine --fp16 --int8

However, when testing performance with /usr/src/tensorrt/bin/trtexec --loadEngine=yourmodel.engine, the performance of the engine that is explicitly specified as dynamic batch is much worse.

yolov7QAT.engine
=== Performance summary === [I] Throughput: 57.8406 qps [I] Latency mean = 17.8946 ms

yolov7QAT.engine with dynamic batch(max=16)
=== Performance summary === [I] Throughput: 23.8396 qps [I] Latency: mean = 42.046 ms