Closed Michelvl92 closed 9 months ago
I have tested it with a newer TensorRT version: 8.6.1, as installed in container: nvcr.io/nvidia/pytorch:23.10-py3
, but I have the same problem. I have not tested it with TensorRT versions >= 9.x since this is not officially supported by: trt-engine-explorer
How did you place the Q/DQ here? Could you please share the onnx here?
One quick way to confirm is use the onnx without Q/DQ, and build the engine with trtexec --onnx=model.onnx --int8 --fp16
, I think you should see it's running in INT8.
@zerollzeng, Nvidia Pytorch-Quantization-Toolkit is used to add Q/DQ. I used the Nvidia example here: tutorials/quant_resnet50, by calling: _Automatic layer substitution is done with quant_modules
. This should be called before model creation._
Would the onnx graph be enough info for you, here you can see the added Q/DQ nodes. yolov5s6 int8 onnx graph
@ttyio should we add Q/DQ before the Mul layer?
@Michelvl92
I think you're missing two pairs of Q/DQ nodes - see the diagrams I've pasted and try adding the two pairs as shown (I've pasted the engine and ONNX graphs for the same subgraph). The Quant Toolkit only handles simple models and didn't add these. Cheers Neta
@nzmora-nvidia Thank you for your answer, i already found this out myself, and I also added even more Q/DQ nodes as follows:
1. Q/DQ in both the inputs of the add in the bottleneck (as you showed) is added, and extra Q/DQ in the concat in the C3 layer is added --> but it looks like there is something failing. The input of each bottleneck (this is the output of the previous conv) is split, and that output of the previous conv: PWN(sigmoid, mul) is not fused in the Conv? The output of the PWN is then split in and in FP16! Why is this, and how to fix this? So everywhere the output of a conv before the bottleneck block the PWN is not fused (result is split) and thus not in int8 but FP16.
Example of a C3 block that includes a bottleneck block
Q/DQ in both the inputs of the add in the bottleneck, but know the conv with pwn(sigmoid + mul) fails:
Failed to quantize resize (upsample) layer, it is correct in ONNX, but not in TRT, why?
Hi @Michelvl92, Please send the SVG images and ONNX files as attachments because it's too hard to follow this thread like this. Are the results here from the latest TRT 8.6.x?
@nzmora-nvidia, I have removed the large SVG/PNG graphs from the previous posts, and updated the question and problem.
The results are currently with TRT 8.4.2.4.
Thanks @Michelvl92, I think that TRT 8.6 will solve some of the issues. Can you give it a try?
@nzmora-nvidia, I have tried with TRT 8.6.1.6-1+cuda12.0.
See TRT graph:
Thanks @Michelvl92, Indeed there are fusions that at face-value should occur and don't. I can't find the attachment of the ONNX file itself so I can't debug - please attach.
I suspect something unexpected is going on with the scales of some of the QDQ nodes. Can you verify that the scales of "/model/model/model.2/m/m.0/cv1/conv/_input_quantizer/QuantizeLinear" and "/model/model/model.2/m/m.0/addop/_input0_quantizer/QuantizeLinear" are the same?
BTW, TRT graphs in SVG format have more information in them (i.e. pop-up when hovering over an operator).
@nzmora-nvidia thanks for your replay, with onnx-surgeon I have removed and reconnected the double Q/DQ nodes, and it looks now correct as you can see. Almost all layers are now int8, only not the last ones at the end, which is not a big problem.
Do you see any points to optimize? The total model speedup of the int8 model compared to the FP16 model is only 14%, is this what you would expect? I would have expected a speedup of 40%.
Do you see how I can make better speed improvements?
Thanks!
@Michelvl92 w/o the ONNX I cannot debug so I'm letting you know the problems that I see in the output engine, but I cannot be sure of their root-cause.
/model/model/model.33/detectquantinput/QuantizeLinear
and /model/model/model.24/conv/_input_quantizer/QuantizeLinear
have the same scale value. If they don't that would explain why they don't fuse./model/model/model.33/detectquantinput/DequantizeLinear
)/model/model/model.33/m.0/Conv
looks problematic. Check if you quantized it correctly.Clear, fixed the issue.
@Michelvl92 w/o the ONNX I cannot debug so I'm letting you know the problems that I see in the output engine, but I cannot be sure of their root-cause.
- Check if
/model/model/model.33/detectquantinput/QuantizeLinear
and/model/model/model.24/conv/_input_quantizer/QuantizeLinear
have the same scale value. If they don't that would explain why they don't fuse.- The extra scale ops are unfused Q/DQ operators (e.g. the two Q operators in the bullet above and
/model/model/model.33/detectquantinput/DequantizeLinear
)/model/model/model.33/m.0/Conv
looks problematic. Check if you quantized it correctly.- These 2 issues repeat several times at the lower part of the graph and that's why you see some no quantized layers.
- Quantizing MaxPooling and Resize for small shapes (like yours) may not help much (GPU utilization) but you get the benefit of Conv outputs being quantized so there's overall benefit.
Could you give me some general pointers for how you handle explicit fp32 operations present in the graph? I am using netron to visualize the quantized graph and I see a lot of mul, add, pow, transpose, mean_reduce operations in fp32.
Description
I tried to convert a Yolov5s ONNX int8 model to a tensorRT int8 model. But all the PointWise operations are not converted to int8 as I would expect. By default, they are converted to FP32 precision, which is unexpected. This means that an "int8" model is higher than the inference latency of an FP16 model.
My goal is to reduce the inference latency of a yolov5s fp16 model, by quantizing it to int8, I expect that this can reduce the inference latency up to 2x. What can IO do to still increase the inference latency of the int8 model?
Environment
I use the following Nvidia docker container:
nvcr.io/nvidia/pytorch:22.08-py3
All the specs can be found in the framework matrix hereTensorRT Version:
NVIDIA GPU: RTX 3090
NVIDIA Driver Version: 535.54.03
TensorRT: 8.4.2.4
CUDA Version: NVIDIA CUDA 11.7 Update
CUDNN Version: 8.5.0.96
Operating System: Ubuntu 20.04
Python Version: python 3.8
PyTorch version: 1.13.0a0+d321be6
Container:
nvcr.io/nvidia/pytorch:22.08-py3
Steps To Reproduce
config.set_flag(trt.BuilderFlag.FP16)
, and the pointwise layers in the model are now FP16. (see attached graph 2). Why are the pointwise layers not converted to int8 Precision? This makes the model far slower.Relevant Files
Yolov5s6 "int8" model with int8 conversion options TRT graph
Yolov5s6 "int8" model with fp16/int8 conversion options TRT graph