Closed Levi-zhan closed 1 year ago
@ttyio ^ ^
For the 1st case, TRT can fuse conv
+ bn
+ relu
into single conv
, fuse pad
+ conv
into conv
, so wrapper the 3 nodes inside DQ
and Q
, we can get INT8-IN-INT8-OUT conv
.
For the 2nd case, TRT can fuse pad
+ conv
into single conv
, so the DQ
should be put before the pad
node.
For the 3rd case, concat
can be erased depends on if the previous conv
implementation support stride output. No need to worry about the Q
after the concat, because Q
commutes with concat
, and TRT can help you to swap it.
For the 1st case, TRT can fuse
conv
+bn
+relu
into singleconv
, fusepad
+conv
intoconv
, so wrapper the 3 nodes insideDQ
andQ
, we can get INT8-IN-INT8-OUTconv
.For the 2nd case, TRT can fuse
pad
+conv
into singleconv
, so theDQ
should be put before thepad
node.For the 3rd case,
concat
can be erased depends on if the previousconv
implementation support stride output. No need to worry about theQ
after the concat, becauseQ
commutes withconcat
, and TRT can help you to swap it.
Thank you。Your explanation intensity my understanding of QAT.I still need to consult how to put DQ before pad?when I read code of pytorch-quantization module, by replacing nn.conv with pytorch quantization.quant nn.QuantConv2d, can insert Q/DQ before Conv。But I can't find an example of how to insert DQ before Pad for my reference.
For the 1st case, TRT can fuse
conv
+bn
+relu
into singleconv
, fusepad
+conv
intoconv
, so wrapper the 3 nodes insideDQ
andQ
, we can get INT8-IN-INT8-OUTconv
.For the 2nd case, TRT can fuse
pad
+conv
into singleconv
, so theDQ
should be put before thepad
node.For the 3rd case,
concat
can be erased depends on if the previousconv
implementation support stride output. No need to worry about theQ
after the concat, becauseQ
commutes withconcat
, and TRT can help you to swap it.
Your "before" is a little ambiguous. It may be better to understand that it is better to move Q up and let DQ under the Pad? Similar to Q ->Pad ->DQ ->Conv, is this understanding correct?
Q/DQ always appear in pairs, so the correct sequence is Q->DQ->Pad->Conv.
To insert Q/DQ before Pad, you need explicit insert quant_nn.TensorQuantizer
before nn.pad
, could you follow this resnet sample https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html#further-optimization ?
Q/DQ always appear in pairs, so the correct sequence is Q->DQ->Pad->Conv.
To insert Q/DQ before Pad, you need explicit insert
quant_nn.TensorQuantizer
beforenn.pad
, could you follow this resnet sample https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html#further-optimization ?
Thank you,I will try。Another problem, I don't know the principle of Tensort fusion。 About this sentence “TRT can fuse conv + bn + relu into single conv, fuse pad + conv into conv, so wrapper the 3 nodes inside DQ and Q, we can get INT8-IN-INT8-OUT conv”
The following figure is part of the network structure
and this is the corresponding log when tensort performs the fusion
TensorRT VERBOSE Running: QuantizeDoubleInputNodes on Conv_135 TensorRT VERBOSE QuantizeDoubleInputNodes: fusing QuantizeLinear_140 into Conv_135 TensorRT VERBOSE QuantizeDoubleInputNodes: fusing (DequantizeLinear_131 and DequantizeLinear_134) into Conv_135 TensorRT VERBOSE Removing QuantizeLinear_140 TensorRT VERBOSE Removing DequantizeLinear_131 TensorRT VERBOSE Removing DequantizeLinear_134
reference https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
My question is why the two DQs are integrated into Conv, and what is the calculation order?
According to the above int8 calculation principle, use input_DQ and weight_DQ scale the int8 result (DP4A)to fp32 to calculate BN and Relu, and then use Q to scale the activation value calculated by Relu to int8. Is this understanding correct?
conv
I modified my code as follows: block = Sequential( quant_nn.TensorQuantizer(quant_nn.QuantConv2d.default_quant_desc_input), nn.ZeroPad2d(1),
quant_nn.QuantConv2d(inplanes, planes, 3, stride=stride, bias=False),
build_norm_layer(self._norm_cfg, planes)[1],
# nn.BatchNorm2d(planes, eps=1e-3, momentum=0.01),
nn.ReLU6(),
)
The obtained structure diagram is as follows: Some logs are as follows: TensorRT VERBOSE Swapping DequantizeLinear_124 with Pad_139 Layer(Padding): Pad_139, Tactic: 0x0000000000000000, 780[Int8(1,64,200,200)] -> 831[Int8(1,64,202,202)] The log of the combination of Pad and conv was not found Is this normal? Is the code and network structure I wrote above correct? Thank you
@ttyio
why the two DQs are integrated into Conv, and what is the calculation order?
Because DQs before conv and the Q after conv are used for the output_scale / (input_scale * weight_scale)
in the formula you attached in the slides. So they need fused into the op.
The log of the combination of Pad and conv was not found
Could you try remove the Q/DQ between Pad and conv?
Thanks!
why the two DQs are integrated into Conv, and what is the calculation order?
Because DQs before conv and the Q after conv are used for the
output_scale / (input_scale * weight_scale)
in the formula you attached in the slides. So they need fused into the op.The log of the combination of Pad and conv was not found
Could you try remove the Q/DQ between Pad and conv?
Thanks!
Thank you for your suggestion! After modifying “pytorch_quantization/nn/modules/quant_conv.py“ and generating engine,achieved remove the Q/DQ between Pad and conv, Pad has been merged with conv。The inference speed of int8 began to be faster than that of fp16 (the speed was almost the same before)。Now the inference speed of fp16 model is 28 ms, and that of int8 model is 21 ms, It doesn't seem to meet expectations。
The following is a structure diagram of the output layer
This is the corresponding log
Layer(CaskConvolution): model.bbox_head.tasks.0.reg.3.weight + QuantizeLinear_461 + Conv_463, Tactic: 0x91a0ba608d3e9886, 1305[Int8(1,64,200,200)] -> Reformatted Output Tensor 0 to model.bbox_head.tasks.0.reg.3.weight + QuantizeLinear_461 + Conv_463[Half(1,2,200,200)]
Layer(Reformat): Reformatting CopyNode for Output Tensor 0 to model.bbox_head.tasks.0.reg.3.weight + QuantizeLinear_461 + Conv_463, Tactic: 0x0000000000000000, Reformat ted Output Tensor 0 to model.bbox_head.tasks.0.reg.3.weight + QuantizeLinear_461 + Conv_463[Half(1,2,200,200)] -> 1314[Float(1,2,200,200)]
Does this count as int8-in -->conv -->fp32out? Is this log normal? Do you have any suggestions on how to further optimize the time consumption? thank you.
The environment I use is Tensorrt 8.4+cuda11.4, and the hardware is Xavier
@ttyio
@Levi-zhan , Could you try insert Q/DQ
pair after Conv
in the quantization toolkit, after that remove the final DQ
in the graph using onnx graphsurgeon(https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon). The graph should look like ...->Q->DQ->Conv->Q->output
Finally let's build the TRT engine with reformat free IO, set allowed output to INT8 (see sample https://github.com/NVIDIA/TensorRT/tree/main/samples/sampleIOFormats). Ideally the final conv can run in INT8-IN-INT8-OUT. Thanks!
@Levi-zhan , Could you try insert
Q/DQ
pair afterConv
in the quantization toolkit, after that remove the finalDQ
in the graph using onnx graphsurgeon(https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon). The graph should look like...->Q->DQ->Conv->Q->output
Finally let's build the TRT engine with reformat free IO, set allowed output to INT8 (see sample https://github.com/NVIDIA/TensorRT/tree/main/samples/sampleIOFormats). Ideally the final conv can run in INT8-IN-INT8-OUT. Thanks!
@ttyio Thank you for your suggestion. I'll try again。Do you have time to answer another issue? Now there is still ConvTranspose that cannot be fused with Q/DQ. This makes me very confused . Thanks! https://github.com/NVIDIA/TensorRT/issues/2597
@Levi-zhan , Could you try insert
Q/DQ
pair afterConv
in the quantization toolkit, after that remove the finalDQ
in the graph using onnx graphsurgeon(https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon). The graph should look like...->Q->DQ->Conv->Q->output
Finally let's build the TRT engine with reformat free IO, set allowed output to INT8 (see sample https://github.com/NVIDIA/TensorRT/tree/main/samples/sampleIOFormats). Ideally the final conv can run in INT8-IN-INT8-OUT. Thanks!
I saw the code in the link you sent. In this example, the output classification of int8 format is directly used. But the model I use is more complex. The model has 40 conv layers for output results. For example, the output result of the conv layer in the first picture is the offset of the box coordinate xy. The second image conv layer is output to simmod. These all require that the input is of float type. In this case, "reformat free IO" is not supported,right?
I don't know if this understanding is correct. Do you have any optimization suggestions? thank you
@ttyio
@Levi-zhan ,
for the conv-exp
pattern in the pic that you show, I doubt that we can maintain the accuracy if we insert extra q/dq between conv
and exp
.
for the conv-sigmoid-*
pattern, if we change to conv-sigmoid-q/dq-*
, there will dangling dq
node cannot optimized away. you can do an experiment to check the time change.
Thank you for your reply. I will try. I want to ask, under the mode of "conv-sigmoid-q/dq - *", will sigmod be fused with conv when Tensorrt generating engine? thank you
@Levi-zhan , yes it is fused into conv, can be confirmed after enable kVERBOSE
log level, thanks!
I have modified the output of the model to the following figure, but sigmod is still not fused with conv。
Do you know why? thank you
Hi @Levi-zhan , it also depends on the workload size. e.g, input channel, output channel. TRT has some heuristic before make the fusion decision. Could you provide onnx file only contains this subgraph for debug? thanks!
closing since no activity for more than 3 weeks, pls reopen if you still have question, thanks!
Description
According to the tutorial, in order to quantify more layers and get int8-in-int8-out, I need to follow the conv with a Q, but if the relu is followed by a Pad, what should I do?
For example, the following figure can get int8-in-int8-out But as shown in the following figure, how can I insert Q/QD nodes with pads behind relu? How can I optimize the concat related layers? Thanks
Environment
TensorRT Version: NVIDIA GPU: NVIDIA Driver Version: CUDA Version: CUDNN Version: Operating System: Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):
Relevant Files
Steps To Reproduce