NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.84k stars 2.14k forks source link

How to insert Q/DQ nodes between relu and pad? #2592

Closed Levi-zhan closed 1 year ago

Levi-zhan commented 1 year ago

Description

According to the tutorial, in order to quantify more layers and get int8-in-int8-out, I need to follow the conv with a Q, but if the relu is followed by a Pad, what should I do?

For example, the following figure can get int8-in-int8-out 2023-01-06_17-06 But as shown in the following figure, how can I insert Q/QD nodes with pads behind relu? 2023-01-06_17-07 How can I optimize the concat related layers? 2023-01-06_17-07_1 Thanks

Environment

TensorRT Version: NVIDIA GPU: NVIDIA Driver Version: CUDA Version: CUDNN Version: Operating System: Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

Relevant Files

Steps To Reproduce

zerollzeng commented 1 year ago

@ttyio ^ ^

ttyio commented 1 year ago

For the 1st case, TRT can fuse conv + bn + relu into single conv, fuse pad + conv into conv, so wrapper the 3 nodes inside DQ and Q, we can get INT8-IN-INT8-OUT conv.

For the 2nd case, TRT can fuse pad + conv into single conv, so the DQ should be put before the pad node.

For the 3rd case, concat can be erased depends on if the previous conv implementation support stride output. No need to worry about the Q after the concat, because Q commutes with concat, and TRT can help you to swap it.

JamesKobe23 commented 1 year ago

For the 1st case, TRT can fuse conv + bn + relu into single conv, fuse pad + conv into conv, so wrapper the 3 nodes inside DQ and Q, we can get INT8-IN-INT8-OUT conv.

For the 2nd case, TRT can fuse pad + conv into single conv, so the DQ should be put before the pad node.

For the 3rd case, concat can be erased depends on if the previous conv implementation support stride output. No need to worry about the Q after the concat, because Q commutes with concat, and TRT can help you to swap it.

Thank you。Your explanation intensity my understanding of QAT.I still need to consult how to put DQ before pad?when I read code of pytorch-quantization module, by replacing nn.conv with pytorch quantization.quant nn.QuantConv2d, can insert Q/DQ before Conv。But I can't find an example of how to insert DQ before Pad for my reference.

JamesKobe23 commented 1 year ago

For the 1st case, TRT can fuse conv + bn + relu into single conv, fuse pad + conv into conv, so wrapper the 3 nodes inside DQ and Q, we can get INT8-IN-INT8-OUT conv.

For the 2nd case, TRT can fuse pad + conv into single conv, so the DQ should be put before the pad node.

For the 3rd case, concat can be erased depends on if the previous conv implementation support stride output. No need to worry about the Q after the concat, because Q commutes with concat, and TRT can help you to swap it.

Your "before" is a little ambiguous. It may be better to understand that it is better to move Q up and let DQ under the Pad? Similar to Q ->Pad ->DQ ->Conv, is this understanding correct?

ttyio commented 1 year ago

Q/DQ always appear in pairs, so the correct sequence is Q->DQ->Pad->Conv.

To insert Q/DQ before Pad, you need explicit insert quant_nn.TensorQuantizer before nn.pad, could you follow this resnet sample https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html#further-optimization ?

Levi-zhan commented 1 year ago

Q/DQ always appear in pairs, so the correct sequence is Q->DQ->Pad->Conv.

To insert Q/DQ before Pad, you need explicit insert quant_nn.TensorQuantizer before nn.pad, could you follow this resnet sample https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html#further-optimization ?

Thank you,I will try。Another problem, I don't know the principle of Tensort fusion。 About this sentence “TRT can fuse conv + bn + relu into single conv, fuse pad + conv into conv, so wrapper the 3 nodes inside DQ and Q, we can get INT8-IN-INT8-OUT conv”

The following figure is part of the network structure

d1xBz35q31

and this is the corresponding log when tensort performs the fusion

TensorRT VERBOSE Running: QuantizeDoubleInputNodes on Conv_135 TensorRT VERBOSE QuantizeDoubleInputNodes: fusing QuantizeLinear_140 into Conv_135 TensorRT VERBOSE QuantizeDoubleInputNodes: fusing (DequantizeLinear_131 and DequantizeLinear_134) into Conv_135 TensorRT VERBOSE Removing QuantizeLinear_140 TensorRT VERBOSE Removing DequantizeLinear_131 TensorRT VERBOSE Removing DequantizeLinear_134

reference https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf qOFc2YdLCF

My question is why the two DQs are integrated into Conv, and what is the calculation order?

According to the above int8 calculation principle, use input_DQ and weight_DQ scale the int8 result (DP4A)to fp32 to calculate BN and Relu, and then use Q to scale the activation value calculated by Relu to int8. Is this understanding correct?

Levi-zhan commented 1 year ago

conv

I modified my code as follows: block = Sequential( quant_nn.TensorQuantizer(quant_nn.QuantConv2d.default_quant_desc_input), nn.ZeroPad2d(1),

nn.Conv2d(inplanes, planes, 3, stride=stride, bias=False),

        quant_nn.QuantConv2d(inplanes, planes, 3, stride=stride, bias=False),
        build_norm_layer(self._norm_cfg, planes)[1],
        # nn.BatchNorm2d(planes, eps=1e-3, momentum=0.01),
        nn.ReLU6(),
    )

The obtained structure diagram is as follows: d53GB1mM4H Some logs are as follows: TensorRT VERBOSE Swapping DequantizeLinear_124 with Pad_139 Layer(Padding): Pad_139, Tactic: 0x0000000000000000, 780[Int8(1,64,200,200)] -> 831[Int8(1,64,202,202)] The log of the combination of Pad and conv was not found Is this normal? Is the code and network structure I wrote above correct? Thank you

Levi-zhan commented 1 year ago

@ttyio

ttyio commented 1 year ago

why the two DQs are integrated into Conv, and what is the calculation order?

Because DQs before conv and the Q after conv are used for the output_scale / (input_scale * weight_scale) in the formula you attached in the slides. So they need fused into the op.

The log of the combination of Pad and conv was not found

Could you try remove the Q/DQ between Pad and conv?

Thanks!

Levi-zhan commented 1 year ago

why the two DQs are integrated into Conv, and what is the calculation order?

Because DQs before conv and the Q after conv are used for the output_scale / (input_scale * weight_scale) in the formula you attached in the slides. So they need fused into the op.

The log of the combination of Pad and conv was not found

Could you try remove the Q/DQ between Pad and conv?

Thanks!

Thank you for your suggestion! After modifying “pytorch_quantization/nn/modules/quant_conv.py“ and generating engine,achieved remove the Q/DQ between Pad and conv, Pad has been merged with conv。The inference speed of int8 began to be faster than that of fp16 (the speed was almost the same before)。Now the inference speed of fp16 model is 28 ms, and that of int8 model is 21 ms, It doesn't seem to meet expectations。

The following is a structure diagram of the output layer

TrT7S0efJY

This is the corresponding log

Layer(CaskConvolution): model.bbox_head.tasks.0.reg.3.weight + QuantizeLinear_461 + Conv_463, Tactic: 0x91a0ba608d3e9886, 1305[Int8(1,64,200,200)] -> Reformatted Output Tensor 0 to model.bbox_head.tasks.0.reg.3.weight + QuantizeLinear_461 + Conv_463[Half(1,2,200,200)]
Layer(Reformat): Reformatting CopyNode for Output Tensor 0 to model.bbox_head.tasks.0.reg.3.weight + QuantizeLinear_461 + Conv_463, Tactic: 0x0000000000000000, Reformat ted Output Tensor 0 to model.bbox_head.tasks.0.reg.3.weight + QuantizeLinear_461 + Conv_463[Half(1,2,200,200)] -> 1314[Float(1,2,200,200)]

Does this count as int8-in -->conv -->fp32out? Is this log normal? Do you have any suggestions on how to further optimize the time consumption? thank you.

The environment I use is Tensorrt 8.4+cuda11.4, and the hardware is Xavier

Levi-zhan commented 1 year ago

@ttyio

ttyio commented 1 year ago

@Levi-zhan , Could you try insert Q/DQ pair after Conv in the quantization toolkit, after that remove the final DQ in the graph using onnx graphsurgeon(https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon). The graph should look like ...->Q->DQ->Conv->Q->output

Finally let's build the TRT engine with reformat free IO, set allowed output to INT8 (see sample https://github.com/NVIDIA/TensorRT/tree/main/samples/sampleIOFormats). Ideally the final conv can run in INT8-IN-INT8-OUT. Thanks!

JamesKobe23 commented 1 year ago

@Levi-zhan , Could you try insert Q/DQ pair after Conv in the quantization toolkit, after that remove the final DQ in the graph using onnx graphsurgeon(https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon). The graph should look like ...->Q->DQ->Conv->Q->output

Finally let's build the TRT engine with reformat free IO, set allowed output to INT8 (see sample https://github.com/NVIDIA/TensorRT/tree/main/samples/sampleIOFormats). Ideally the final conv can run in INT8-IN-INT8-OUT. Thanks!

@ttyio Thank you for your suggestion. I'll try again。Do you have time to answer another issue? Now there is still ConvTranspose that cannot be fused with Q/DQ. This makes me very confused . Thanks! https://github.com/NVIDIA/TensorRT/issues/2597

Levi-zhan commented 1 year ago

@Levi-zhan , Could you try insert Q/DQ pair after Conv in the quantization toolkit, after that remove the final DQ in the graph using onnx graphsurgeon(https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon). The graph should look like ...->Q->DQ->Conv->Q->output

Finally let's build the TRT engine with reformat free IO, set allowed output to INT8 (see sample https://github.com/NVIDIA/TensorRT/tree/main/samples/sampleIOFormats). Ideally the final conv can run in INT8-IN-INT8-OUT. Thanks!

I saw the code in the link you sent. In this example, the output classification of int8 format is directly used. But the model I use is more complex. The model has 40 conv layers for output results. For example, the output result of the conv layer in the first picture is the offset of the box coordinate xy. The second image conv layer is output to simmod. These all require that the input is of float type. In this case, "reformat free IO" is not supported,right?

AU6VoBzfAA xGw5Q0E8Vj

I don't know if this understanding is correct. Do you have any optimization suggestions? thank you

Levi-zhan commented 1 year ago

@ttyio

ttyio commented 1 year ago

@Levi-zhan , for the conv-exp pattern in the pic that you show, I doubt that we can maintain the accuracy if we insert extra q/dq between conv and exp. for the conv-sigmoid-* pattern, if we change to conv-sigmoid-q/dq-*, there will dangling dq node cannot optimized away. you can do an experiment to check the time change.

Levi-zhan commented 1 year ago

Thank you for your reply. I will try. I want to ask, under the mode of "conv-sigmoid-q/dq - *", will sigmod be fused with conv when Tensorrt generating engine? thank you

ttyio commented 1 year ago

@Levi-zhan , yes it is fused into conv, can be confirmed after enable kVERBOSE log level, thanks!

Levi-zhan commented 1 year ago

I have modified the output of the model to the following figure, but sigmod is still not fused with conv。

image

Do you know why? thank you

ttyio commented 1 year ago

Hi @Levi-zhan , it also depends on the workload size. e.g, input channel, output channel. TRT has some heuristic before make the fusion decision. Could you provide onnx file only contains this subgraph for debug? thanks!

ttyio commented 1 year ago

closing since no activity for more than 3 weeks, pls reopen if you still have question, thanks!