why the yolov8 int8 quant using pytorch_quant is slower than trt --fp16 quant

NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

https://developer.nvidia.com/tensorrt

Apache License 2.0

10.47k stars 2.1k forks source link

why the yolov8 int8 quant using pytorch_quant is slower than trt --fp16 quant #3762

Open luoshiyong opened 4 months ago

luoshiyong commented 4 months ago

devicec : nvidia NX 1.using trt --fp16 /usr/src/tensorrt/bin/trtexec --onnx=best.onnx --workspace=4096 --saveEngine=best.engine --fp16 the result of infer speed is 36.8ms

using pytorch_quant int8 /usr/src/tensorrt/bin/trtexec --onnx=best.onnx --saveEngine=v8s_ptq.engine --int8 --workspace=4096 the result of infer speed is : 39.5ms

### Tasks

luoshiyong commented 4 months ago

devicec : nvidia NX 1.using trt --fp16 /usr/src/tensorrt/bin/trtexec --onnx=best.onnx --workspace=4096 --saveEngine=best.engine --fp16 the result of infer speed is 36.8ms

using pytorch_quant int8 /usr/src/tensorrt/bin/trtexec --onnx=best.onnx --saveEngine=v8s_ptq.engine --int8 --workspace=4096 the result of infer speed is : 39.5ms

lix19937 commented 4 months ago

--int8 means Enable int8 precision, in addition to fp32.
--fp16 means Enable fp16 precision, in addition to fp32. Maybe your bad q-dq setting. You can compare layer profile detail among two build logs.

luoshiyong commented 4 months ago

--int8 means Enable int8 precision, in addition to fp32. --fp16 means Enable fp16 precision, in addition to fp32. Maybe your bad q-dq setting. You can compare layer profile detail among two build logs.

i have try varies command such as "--bset" or "--int8 --fp16 --noTF32" ,but no use, i fully understand the option "--int8 --int16 --best". What's mean the bad q-dq setting ? what i get is some layers is not applicable to int8 quant.

lix19937 commented 4 months ago

If you use trt ptq quant(implicitly quantized), you can ignore q-dq setting.

luoshiyong commented 2 months ago

but i use the tensorrt explicitly quant, insert q/dq node, i want to know how to insert proper q/dq for faster perfomance, can you give me some advice?

lix19937 commented 2 months ago

@luoshiyong Ref ptq fusion result.