Pointwise layers are not converted tot int8 but to fp32/fp16 with TensorRT 8.4.2.4 when running on GPU RTX 3090

Michelvl92 commented 11 months ago

Description

I tried to convert a Yolov5s ONNX int8 model to a tensorRT int8 model. But all the PointWise operations are not converted to int8 as I would expect. By default, they are converted to FP32 precision, which is unexpected. This means that an "int8" model is higher than the inference latency of an FP16 model.

My goal is to reduce the inference latency of a yolov5s fp16 model, by quantizing it to int8, I expect that this can reduce the inference latency up to 2x. What can IO do to still increase the inference latency of the int8 model?

Environment

I use the following Nvidia docker container: nvcr.io/nvidia/pytorch:22.08-py3 All the specs can be found in the framework matrix here

TensorRT Version:

NVIDIA GPU: RTX 3090

NVIDIA Driver Version: 535.54.03

TensorRT: 8.4.2.4

CUDA Version: NVIDIA CUDA 11.7 Update

CUDNN Version: 8.5.0.96

Operating System: Ubuntu 20.04

Python Version: python 3.8

PyTorch version: 1.13.0a0+d321be6

Container: nvcr.io/nvidia/pytorch:22.08-py3

Steps To Reproduce

I have quantized a default pre-trained yolov5s6 model to int8 with Nvidia PyTorch post-training quantization
I have exported the quantized model with export.py, which: converts the model to ONNX (optset 12), and then converts the model to TensorRT.
By inspecting the model TensorRT graph, I found out that the "int8" pointwise layers in the model are not converted to int8 but fp32. (see attached graph 1) The "fp16" model is faster than the "int8" model.
I have added the option config.set_flag(trt.BuilderFlag.FP16), and the pointwise layers in the model are now FP16. (see attached graph 2). Why are the pointwise layers not converted to int8 Precision? This makes the model far slower.

Relevant Files

Yolov5s6 "int8" model with int8 conversion options TRT graph

Yolov5s6 "int8" model with fp16/int8 conversion options TRT graph

Michelvl92 commented 11 months ago

I have tested it with a newer TensorRT version: 8.6.1, as installed in container: nvcr.io/nvidia/pytorch:23.10-py3, but I have the same problem. I have not tested it with TensorRT versions >= 9.x since this is not officially supported by: trt-engine-explorer

zerollzeng commented 11 months ago

How did you place the Q/DQ here? Could you please share the onnx here?

One quick way to confirm is use the onnx without Q/DQ, and build the engine with trtexec --onnx=model.onnx --int8 --fp16, I think you should see it's running in INT8.

Michelvl92 commented 11 months ago

@zerollzeng, Nvidia Pytorch-Quantization-Toolkit is used to add Q/DQ. I used the Nvidia example here: tutorials/quant_resnet50, by calling: _Automatic layer substitution is done with quant_modules. This should be called before model creation._

Would the onnx graph be enough info for you, here you can see the added Q/DQ nodes. yolov5s6 int8 onnx graph

zerollzeng commented 11 months ago

@ttyio should we add Q/DQ before the Mul layer?

nzmora-nvidia commented 11 months ago

@Michelvl92

I think you're missing two pairs of Q/DQ nodes - see the diagrams I've pasted and try adding the two pairs as shown (I've pasted the engine and ONNX graphs for the same subgraph). The Quant Toolkit only handles simple models and didn't add these. Cheers Neta

Michelvl92 commented 11 months ago

@nzmora-nvidia Thank you for your answer, i already found this out myself, and I also added even more Q/DQ nodes as follows:

1. Q/DQ in both the inputs of the add in the bottleneck (as you showed) is added, and extra Q/DQ in the concat in the C3 layer is added --> but it looks like there is something failing. The input of each bottleneck (this is the output of the previous conv) is split, and that output of the previous conv: PWN(sigmoid, mul) is not fused in the Conv? The output of the PWN is then split in and in FP16! Why is this, and how to fix this? So everywhere the output of a conv before the bottleneck block the PWN is not fused (result is split) and thus not in int8 but FP16.

Q/DQ all the inputs of the maxpool layer + one at the output in the SPPF block --> this worked! 4. Q/DQ in the input of the upsample layer --> this didn't failed, see onnx/tensorrt graph
Q/DQ in the other concat layers --> it looks like it worked?

Example of a C3 block that includes a bottleneck block

Example of an SPPF layer

Q/DQ in both the inputs of the add in the bottleneck, but know the conv with pwn(sigmoid + mul) fails: onnx_quant_bottleneck

tensorrt_quant_bottleneck

Failed to quantize resize (upsample) layer, it is correct in ONNX, but not in TRT, why? tensorrt_quant_upsample

onnx_quant_upsample

Full ONNX graph with additional q/dq nodes Full TRT graph

nzmora-nvidia commented 11 months ago

Hi @Michelvl92, Please send the SVG images and ONNX files as attachments because it's too hard to follow this thread like this. Are the results here from the latest TRT 8.6.x?

Michelvl92 commented 11 months ago

@nzmora-nvidia, I have removed the large SVG/PNG graphs from the previous posts, and updated the question and problem.

The results are currently with TRT 8.4.2.4.

nzmora-nvidia commented 11 months ago

Thanks @Michelvl92, I think that TRT 8.6 will solve some of the issues. Can you give it a try?

Michelvl92 commented 11 months ago

@nzmora-nvidia, I have tried with TRT 8.6.1.6-1+cuda12.0.

See TRT graph:

resize is now in int8! --> This is fixed now with the new TRT version
For some strange reason conv + (sigmoid + mul) is still not always fused (in bottleneck layer), and the behaviour is very strange sometimes in the graph

See TRT graph

nzmora-nvidia commented 11 months ago

Thanks @Michelvl92, Indeed there are fusions that at face-value should occur and don't. I can't find the attachment of the ONNX file itself so I can't debug - please attach.

I suspect something unexpected is going on with the scales of some of the QDQ nodes. Can you verify that the scales of "/model/model/model.2/m/m.0/cv1/conv/_input_quantizer/QuantizeLinear" and "/model/model/model.2/m/m.0/addop/_input0_quantizer/QuantizeLinear" are the same?

BTW, TRT graphs in SVG format have more information in them (i.e. pop-up when hovering over an operator).

Michelvl92 commented 11 months ago

@nzmora-nvidia thanks for your replay, with onnx-surgeon I have removed and reconnected the double Q/DQ nodes, and it looks now correct as you can see. Almost all layers are now int8, only not the last ones at the end, which is not a big problem.

Do you see any points to optimize? The total model speedup of the int8 model compared to the FP16 model is only 14%, is this what you would expect? I would have expected a speedup of 40%.

The input can be made int8 since the reformat in the input takes up to 10% of the total latency budget
The speed of the max-pooling is similar as in the fp16 model, why?
The speed of the resize is similar as in the fp16 model, why?
For both max-pooling and resizing there are scale operations added that take 3.% of the latency budget, why?
Why is the pointwise layer at some parts not fused?

Do you see how I can make better speed improvements?

Thanks!

TRT graph

nzmora-nvidia commented 10 months ago

@Michelvl92 w/o the ONNX I cannot debug so I'm letting you know the problems that I see in the output engine, but I cannot be sure of their root-cause.

Check if /model/model/model.33/detectquantinput/QuantizeLinear and /model/model/model.24/conv/_input_quantizer/QuantizeLinear have the same scale value. If they don't that would explain why they don't fuse.
The extra scale ops are unfused Q/DQ operators (e.g. the two Q operators in the bullet above and /model/model/model.33/detectquantinput/DequantizeLinear)
/model/model/model.33/m.0/Conv looks problematic. Check if you quantized it correctly.
These 2 issues repeat several times at the lower part of the graph and that's why you see some no quantized layers.
Quantizing MaxPooling and Resize for small shapes (like yours) may not help much (GPU utilization) but you get the benefit of Conv outputs being quantized so there's overall benefit.

Michelvl92 commented 9 months ago

Clear, fixed the issue.

Raj-vivid commented 1 month ago

@Michelvl92 w/o the ONNX I cannot debug so I'm letting you know the problems that I see in the output engine, but I cannot be sure of their root-cause.

Check if /model/model/model.33/detectquantinput/QuantizeLinear and /model/model/model.24/conv/_input_quantizer/QuantizeLinear have the same scale value. If they don't that would explain why they don't fuse.

The extra scale ops are unfused Q/DQ operators (e.g. the two Q operators in the bullet above and /model/model/model.33/detectquantinput/DequantizeLinear)

/model/model/model.33/m.0/Conv looks problematic. Check if you quantized it correctly.

These 2 issues repeat several times at the lower part of the graph and that's why you see some no quantized layers.

Quantizing MaxPooling and Resize for small shapes (like yours) may not help much (GPU utilization) but you get the benefit of Conv outputs being quantized so there's overall benefit.

Could you give me some general pointers for how you handle explicit fp32 operations present in the graph? I am using netron to visualize the quantized graph and I see a lot of mul, add, pow, transpose, mean_reduce operations in fp32.

NVIDIA / TensorRT