Closed JosephChenHub closed 3 years ago
Hello @JosephChenHub , TRT can fuse conv + relu + add together, but the fused output and residual has to be the same data type, and the monkey patch in QAT cannot reconignize this conv + relu + residual pattern, could you follow the linked code to add quant in the residual? thanks!
Hello @JosephChenHub , TRT can fuse conv + relu + add together, but the fused output and residual has to be the same data type, and the monkey patch in QAT cannot reconignize this conv + relu + residual pattern, could you follow the linked code to add quant in the residual? thanks!
hi, thanks! I add the quant in the residual but it still fails to be converted into INT8 mode. Does the concat
affect this fusion ?
Hello @JosephChenHub , Yes the concat here would break the fusion pattern I mentioned before. is it a different case? I did not see concat in the original description.
Hello @JosephChenHub , Yes the concat here would break the fusion pattern I mentioned before. is it a different case? I did not see concat in the original description.
I see, the concat
is indeed a copy
operation by TRT and it falls into fp32 mode. So how to prevent it ?
[TensorRT] VERBOSE: Layer(ElementWise): Add_132, Tactic: 1, 451[Float(32,160,160)], Conv_142 + Relu_149 || Conv_84 + Relu_91[Float(32,160,160)] -> 478[Float(32,160,160)]
[TensorRT] VERBOSE: Layer(Reformat): 477 copy, Tactic: 0, Conv_142 + Relu_149 || Conv_84 + Relu_91[Float(32,160,160)] -> 478[Float(32,160,160)]
Hello @JosephChenHub , do you have the full verbose log that I can check? thanks!
Hello @JosephChenHub , do you have the full verbose log that I can check? thanks!
Hi @ttyio , I insert the quantizers into the Add
and Concat
and the evaluation result is 0.250mAP/544FPS while the result of TRT is 0.327mAP/544FPS. I guess the degrade resulted from the Add
or Concat
, since they have two different range scale as shown in the following graph.
Hello @JosephChenHub , do you have the full verbose log that I can check? thanks!
Hi @ttyio , I insert the quantizers into the
Add
andConcat
and the evaluation result is 0.250mAP/544FPS while the result of TRT is 0.327mAP/544FPS. I guess the degrade resulted from theAdd
orConcat
, since they have two different range scale as shown in the following graph.
OK, I have solved this issue, and it's interesting that the result reaches up to 0.340mAP/544FPS (TRT: 0.327/544).
Cool @JosephChenHub , for the concat question, TRT support per channel scale internally for concat.
may I know how you fix the issue? thanks!
Hello @JosephChenHub , do you have the full verbose log that I can check? thanks!
Hi @ttyio , I insert the quantizers into the
Add
andConcat
and the evaluation result is 0.250mAP/544FPS while the result of TRT is 0.327mAP/544FPS. I guess the degrade resulted from theAdd
orConcat
, since they have two different range scale as shown in the following graph.OK, I have solved this issue, and it's interesting that the result reaches up to 0.340mAP/544FPS (TRT: 0.327/544).
hello, I met the same issue with you, can you share the way you sovle it? may I have your QQ or Wechat to learn from you?
Hi, would you share how you solve this problem?
does someone have ideas about dealing with the different scale before add node?
Description
Currently, we did the comparison experiments with the following two settings:
pytorch_quantization
, we can parse the exported onnx model with Q/DQ nodes to generate the calibration table, and then obtain the TensorRT engine.However, we observe that the inference speed get worse than FP16 in method 1. where
TRT
means the engine generated via method 2, andQDQ
refers to the engine generated by method 1. By checking the log, we can find that some layers of method 1 still remains FP32, e.g, TheConv_101 +Relu_108
, andAdd
are still in FP32 mode, and theConv_101
is the first conv. block as shown in the exported onnx model . So questions come thatConv_101+Relu_108
fail to converted into INT8 mode?Add
operator implemented when INT8 mode, as the following log of method 2 show ? For example,z = x + y
, wheremax(abs(x)) = 2.5
andmax(abs(y)) = 2.5
, we can add the quantization numbers first and then dequantize the result, but what if they have different maximum range?Environment
TensorRT Version: 7.2.1 NVIDIA GPU: GTX 2080Ti NVIDIA Driver Version: 440 CUDA Version: 10.2 CUDNN Version: 8.0 Operating System: Ubuntu 18.04 Python Version (if applicable): 3.6 Tensorflow Version (if applicable):
PyTorch Version (if applicable): 1.7.1 Baremetal or Container (if so, version):
Relevant Files
part of the exported onnx model with Q/DQ nodes: