Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.59k stars 510 forks source link

TensorRT performance is slower than that reported on Deci-AI website #1346

Closed Gaozhongpai closed 1 year ago

Gaozhongpai commented 1 year ago

💡 Your Question

I followed the colab: https://colab.research.google.com/drive/1yHrHkUR1X2u2FjjvNMfUbSXTkUul6o1P?usp=sharing for Quantization-Aware Finetuning YoloNAS on a Custom Dataset. After that, I converted the qat onnx model to tensorrt as trtexec --fp16 --int8 --avgRuns=100 --onnx=yolonas-hand-quan_16x3x640x640_qat.onnx.

The result is slower than that reported here:

image

Versions

No response

BloodAxe commented 1 year ago

It looks like you passed --fp16 and --int8 flags at the same time to TRT. So which mode was it benchmarking? Can you please clarify what variant of YoloNAS did you export?

BloodAxe commented 1 year ago

Ok, seems like you indeed can pass fp16 and int8 flags simultaneously. Let us check internally what is happening. I will get back once we have more info on this.

Gaozhongpai commented 1 year ago

Thank you for the speedy response. I am using yolo_nas_l. I followed the instruction here: https://docs.deci.ai/super-gradients/documentation/source/BenchmarkingYoloNAS.html#step-1-export-yolonas-to-onnx, which passed --fp16 and --int8 flags at the same time.

Gaozhongpai commented 1 year ago

In comparison with Yolov7-qat, using the same conversion as trtexec --fp16 --int8 --avgRuns=100 --onnx=yolov7_qat_640.onnx:

image

BloodAxe commented 1 year ago

So based on the command trtexec --fp16 --int8 --avgRuns=100 --onnx=yolonas-hand-quan_16x3x640x640_qat.onnx it looks like you exported and benchmarked model with batch size 16. So throughput of 11.6 qps is for 16 elements, therefore effective FPS is 11.6 * 16 = 185 fps.

Attaching the docs just in case: https://docs.deci.ai/super-gradients/documentation/source/BenchmarkingYoloNAS.html

BloodAxe commented 1 year ago

Related issues:

Gaozhongpai commented 1 year ago

So based on the command trtexec --fp16 --int8 --avgRuns=100 --onnx=yolonas-hand-quan_16x3x640x640_qat.onnx it looks like you exported and benchmarked model with batch size 16. So throughput of 11.6 qps is for 16 elements, therefore effective FPS is 11.6 * 16 = 185 fps.

Attaching the docs just in case: https://docs.deci.ai/super-gradients/documentation/source/BenchmarkingYoloNAS.html

Thank you very much.

On this website: https://docs.deci.ai/super-gradients/documentation/source/BenchmarkingYoloNAS.html#step-1-export-yolonas-to-onnx, it uses batch_size = 32. How can the result have 242.751 qps?

BloodAxe commented 1 year ago

Not ready to comment on this at the moment, will clarify with the guys from team who did the benchmarks and get back to you once have more information.