NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.76k stars 2.13k forks source link

TensorRT QAT not support resize op in int8 ? #2976

Closed lix19937 closed 1 year ago

lix19937 commented 1 year ago

Description

How to solve resize(upsample) op in int8 by QAT(tools/pytorch-quantization) ? Except use ConvTranspose

Environment

TensorRT Version:8.4

zerollzeng commented 1 year ago

@ttyio ^ ^

ttyio commented 1 year ago

@lix19937 , could you elaborate more on your issue? no need to insert Q/DQ before resize, we should run resize in INT8 automatically for patterns like resize -> Q/DQ -> conv. Thanks!

lix19937 commented 1 year ago

@ttyio

quant.onnx trex plan
qat qat

It seems that the first resize op not quant with int8, commutes with DQ and with Q.

Monoclinic commented 1 year ago

I have similar problem here. It seems that the input tensor of resize layer (nn.upsample) will be automatically rescaled to fp16/fp32, which takes some time thus the network could be even slower than fp16. In my experiment deconv (ConvTranspose) meets the same problem. Deconv itself will operate in int8, while bn and relu are fp32.

lix19937 commented 1 year ago

@Monoclinic @ttyio If you move all scales of quant.onnx(https://github.com/NVIDIA/TensorRT/issues/2976#issuecomment-1550587913 , the onnx described above), mark as unquant.onnx, then use

trtexec  --best                                     \
--profilingVerbosity=detailed               \ 
--separateProfileRun                          \
--exportProfile=profile.json                 \
--exportLayerInfo=layerinfo.json        \
--onnx=unquant.onnx

the first resize op will run in int8.

Monoclinic commented 1 year ago

@lix19937 Hello, thanks for your reply. May I ask how to remove the scale layer? I've tried to export the onnx model without Q\DQ (just the original pytorch model) , by this way resize will be operated in int8. However if you apply Q\DQ in pytorch and export to onnx, trt would dequantized the tensor to fp32 and then do resize.

ttyio commented 1 year ago

I tried 8.6GA with a toy resize + q/dq + conv model, the resize is running in INT8 precision. Not sure what's corner case you hit here for your 1st resize. Are you using 8.6GA? could you share the onnx file for debug? @lix19937 thank you!

image

lix19937 commented 1 year ago

@Monoclinic Hi, You can remove Q-DQ nodes of quant.onnx by onnx_graphsurgeon or onnx apis, as well as save scales.

It just indicates that unquant onnx can be it can better fusion by TRT PTQ, like resize will run in int8; but by QAT, resize op run in fp32/fp16.

lix19937 commented 1 year ago

@ttyio My TRT version is v8410, the quant.onnx in quant.zip , you can unzip it, thanks
quant.zip

ttyio commented 1 year ago

@lix19937 , we added INT8 resize kernel in 8.4, but the QAT fusion relu is not updated, could you upgrade to 8.6GA?

lix19937 commented 1 year ago

Hi, @ttyio Thanks, it works in v8510 in Orin X 6060,
image

Monoclinic commented 1 year ago

@lix19937 Hello, sorry for disturbing. I also use a Jetson Orin, I wonder how did you install / upgrade your TRT? I tried to install TRT 8.6/8.5 on my Orin (my current version is 8.4.0.1), but the released version are based on x86_64 or ARM SBSA, which are not suitable for Jetson devices. Is it necessary to reinstall the whole Jetson pack with a higher version of TRT?

// 或者看中文更方便一点。 我也用的是orin,现在的TRT版本是8.4,早上尝试尝试装了8.6(x86)和8.5(arm SBSA),安装倒是没什么问题,demo也能跑,但是onnx转trt的时候会直接segmentation fault,原因搞不太清楚。 x86毕竟平台问题,可能就是用不了(但是诡异的是编译能过),arm SBSA我看到nv论坛里有人说不一定适配jetson的板子,实际装完了好像也确实不行。 我想问下您是怎么用的8.5,是机器重新刷了一次jetson pack还是用什么办法从8.4升级上去的。

lix19937 commented 1 year ago

@Monoclinic segmentation fault maybe due to version compatibility issues of cuda-X

In Orin , TRT v8510 means TRT v860, I just change another orin devkit which has been installed drive os 6060( map to TRT v8510).

You can install nv-driveos-repo-sdk-linux-6.0.6.0-32441545_6.0.6.0_amd64.deb as follow:

1. Clean previous installation; 
2. Installing host components on p3710;
3. Flash DRIVE OS Linux;
4. Install CUDA/CUDNN/TENSORRT,  nv-tensorrt-repo-ubuntu2004-cuda11.4-trt8.5.10.4-d6l-target-ga-20221229_1-1_arm64.deb   
Monoclinic commented 1 year ago

@lix19937 Thanks for your kind advices. I will have a try.

wenqibiao commented 1 year ago

@lix19937

Hi, What this sentences means: "In Orin , TRT v8510 means TRT v860"? And, where you get this information? Thank You Very Much!

wenqibiao commented 1 year ago

@ttyio Hi, is that right? "In Orin , TRT v8510 means TRT v860"? I try to use Ampere 4:2 sparse. But, I saw DL model metric large drop problem if I use tensorrt 8.5.3.1. In tensorrt 8.6.1.6, the problem disappeared.

lix19937 commented 1 year ago

@lix19937

Hi, What this sentences means: "In Orin , TRT v8510 means TRT v860"? And, where you get this information? Thank You Very Much!

You can ref NVIDIA-TensorRT-8.5.10-API-Reference-for-DRIVE-OS.pdf, if you want to find correct version you can fine in NvInferVersion.h. Btw, Drive os 6060 frequently update.

lix19937 commented 1 year ago

I try to use Ampere 4:2 sparse. But, I saw DL model metric large drop problem if I use tensorrt 8.5.3.1. In tensorrt 8.6.1.6, the problem disappeared.

You can compare the build tactics between v8531 and v8616 on your model, check the fusion state and sparse layer chose. @wenqibiao

wenqibiao commented 1 year ago

@lix19937 many thanks!