TensorRT QAT model is slower than PTQ model !!!

NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

https://developer.nvidia.com/tensorrt

Apache License 2.0

10.47k stars 2.1k forks source link

TensorRT QAT model is slower than PTQ model !!! #3038

Closed tp111222 closed 10 months ago

tp111222 commented 1 year ago

Description

Yolov8m TensorRT QAT model is slower than PTQ model

Environment

TensorRT Version: 8.4.1.5 NVIDIA GPU: RTX2080 NVIDIA Driver Version:

CUDA Version: CUDA11.1 CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

PTQ inter time: 69d1c9f9d5548dfe1462bfb3cc30ba5

QAT infer time: 70bd9864e94965abedf1f716116ef2f

lix19937 commented 1 year ago

It is quite common that TensorRT QAT model is slower than PTQ model . Maybe Q-DQ not set right, so fusion bad.

zerollzeng commented 1 year ago

Yes.

tp111222 commented 1 year ago

this is yolov8m quant model：

https://pan.baidu.com/s/1MC7BYxs71wzLdEqXBHmF-Q?pwd=ysha

How to set Q-DQ correctly?

zerollzeng commented 1 year ago

We have some guidance about this topic, see https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qdq-placement-recs

This is not a trivial task and needs some try and run.

tp111222 commented 1 year ago

@zerollzeng this is yolov8 quant svg file yolov8m_qat2_layer json Can you give us some guidance on which nodes to optimize

tp111222 commented 1 year ago

Can you give me some guidance on which nodes to optimize?

zerollzeng commented 1 year ago

Well it's hard to analyze through the image...

A way to check with the log: Build a PTQ engine(e.g. with trtexec --int8 --fp16) and compare the final engine(in the verbose log) with the QAT engine. check the different between them, e.g. layer precision(some layer may still running in FP16 in the QAT model) and layer fusions(redundant Q/DQ may break the layer fusion)

tp111222 commented 1 year ago

@zerollzeng What should I do after I find out that PTQ engine and QAT engine are different?

zerollzeng commented 1 year ago

You can check it in the verbose log. there is a section call "Engine Layer Information"

tp111222 commented 1 year ago

@zerollzeng 还是用中文交流吧，我的意思是，对比了PTQ engine 和 QAT engine 不同后，接下来该怎样去优化QAT engine ，使其到达PTQ engine的性能

zerollzeng commented 1 year ago

最终的目标是：让你的qat model build出来的engine结构和ptq的一样（有可能会导致精度下降，需要自己权衡），奔着这个目标去调整优化Q/DQ节点（pytorch-quantization）的摆放

yuanjiechen commented 1 year ago

Based on my result, I guess PTQ has full int8 but less performence on yolo. QAT has many reformat layer consume too many times (INT8-FP32/FP16), you can use engine profiler to get similar result. Replace SiLU to ReLU and re-train the model, then add Q/DQ and finetune the model again. The model has larger chance to INT8, otherwise, use full fp16 will get best inference time.

WeixiangXu commented 1 year ago

借个楼. Do TRT have any suggestions for the quantization deployment of transformer architecture with int8? (considering both speed and precision, with PTQ by calibration or pytorch-QAT or anything else?) @zerollzeng Thanks!

zerollzeng commented 1 year ago

There are some progress and effort about this, I'll let @ttyio to answer the question.

ttyio commented 1 year ago

For transformer, we will release some new demo with FP8 and INT8 in the coming months using Q/DQ nodes. Thanks!

WeixiangXu commented 1 year ago

thanks! where will it be released? @ttyio

ttyio commented 1 year ago

@WeixiangXu sorry we need wait for the official announcement. thanks!

ttyio commented 10 months ago

Now the TensorRT-LLM is released in https://github.com/NVIDIA/TensorRT-LLM closing and thanks all!

J-xinyu commented 4 months ago

@zerollzeng this is yolov8 quant svg file !

Hey bro, how did you get this svg image?

lix19937 commented 4 months ago

@zerollzeng this is yolov8 quant svg file !

Hey bro, how did you get this svg image?

trex
https://github.com/NVIDIA/TensorRT/tree/release/8.6/tools/experimental/trt-engine-explorer