NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.84k stars 2.14k forks source link

How to reduce reformat layer in QAT??? #4039

Open yuanjiechen opened 3 months ago

yuanjiechen commented 3 months ago

Description

I want to finetune a quantized yolo model, and export to TRT. I carefully read the QDQ document and some existed issues to place and remove unused QDQ nodes, the model have 92% int8 precision layers but still have 70 reformat layers.

Environment

TensorRT Version: 8.5.0.2

NVIDIA GPU: agx orin 16GB

NVIDIA Driver Version:

CUDA Version: 11.4

CUDNN Version:

Operating System: Ubuntu 20.04

Python Version (if applicable): 3.8.10

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

trex

Model link:

model.zip Use normal trtexec command to convert onnx to engine.

lix19937 commented 3 months ago

Reformat layer usually caused by data type switching or data memory layout adjusting.

yuanjiechen commented 3 months ago

Reformat layer usually caused by data type switching or data memory layout adjusting.

Thanks for your reply. I understand the usage of reformat layer. But in the figure, most input and output of reformat layer is same INT8 NCHW32, and I doubt about why transform NCHW32 to NCHW4 than thansform back? Compared to PTQ model trex results, their are only 20+ reformat layers, is my fault on QDQ placement?

Athother question, is the TRT 8.5 support channel-wise activation quantizatin on conv layers, for example, set axis=(1) and give the save number of input channel amax in pytorch-quantization; the model can be export to onnx, but contain error when convert to TRT. (Acturally I put the activation channel-wise QDQ before split layer) Screenshot from 2024-08-01 09-17-50

lix19937 commented 3 months ago

most input and output of reformat layer is same INT8 NCHW32, and I doubt about why transform NCHW32 to NCHW4 than thansform back?

INT8 NCHW32 --> NCHW4 --> NCHW32
like follow
image

It just choose the best layout to concat and conv op to get best performance.

Compared to PTQ model trex results, their are only 20+ reformat layers, is my fault on QDQ placement?

Yes, can you uoload the svg file of ptq engine ?

yuanjiechen commented 3 months ago

@lix19937 Thanks for your reply. Here is ptq engine svg figure. ptq

ttyio commented 3 months ago

@yuanjiechen could you upgrade your trt version to latest 10.x and retry? thanks!

yuanjiechen commented 3 months ago

@ttyio Thanks for your reply! The nearest trt version on jetson orin which I can find is 8.6.x?

ttyio commented 3 months ago

@ttyio Thanks for your reply! The nearest trt version on jetson orin which I can find is 8.6.x?

yes, let's see if 8.6 helps here. thanks!

yuanjiechen commented 3 months ago

@ttyio Thanks for your reply! My orin nx die after update to 8.6 by apt. And I reflash the disk and successfully run jetpack 6.0. But when I build the engine from same onnx used in 8.5, the engine takes much more times at trt 8.6, almost 3~4x times? And I try to use another nms method such as INMSlayer, but it contains another bugs

[08/15/2024-10:29:05] [TRT] [E] (Unnamed Layer* 1057) [Constant]:constant weights has count 1 but 0 was expected [08/15/2024-10:29:05] [TRT] [E] [network.cpp::addNMS::1636] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/network.cpp::addNMS::1636, condition: maxOutputBoxesPerClass.getDimensions().nbDims == 0

Here is my code related to error: slice_box is slice layer output (1, 8400, 4) slice_class is slice layer output (1, 8400, 20) max_det = 300 # a integer det_pack = self.network.add_constant([1], np.array(max_det, dtype=np.int32)) iou_pack = self.network.add_constant([1], np.array([iou_thres], dtype=np.float32)) conf_pack = self.network.add_constant([1], np.array([conf_thres], dtype=np.float32))

    nms_layer = self.network.add_nms(slice_box.get_output(0), slice_class.get_output(0), det_pack.get_output(0)) # -->error
    nms_layer.set_input(3, iou_pack)
    nms_layer.set_input(4, conf_pack)

What are the difference between 3 nms implement method in trt? (nmslayer, batchnmsplugin, efficientnmsplugin)

yuanjiechen commented 3 months ago

I test the same onnx model on another orin nx 8GB, run faster than the 16GB board 2.5x times (46 ms -> 21 ms). 即使 trt 8.6 provide 10% less reformat layer than 8.5, it's also far away from ptq

lix19937 commented 1 month ago

Ref ptq fusion to adjust the q-dq location.