Closed joytsay closed 2 years ago
Encountering akin performance drop problem from fp16 to int8 with LSTM model.
TensorRT Version: 8.2.0.6EA GPU Type: A100 Nvidia Driver Version: 455.23.05 CUDA Version: 11.1 CUDNN Version: 8.0 Operating System + Version: CentOS 8.1 Python Version (if applicable): 3.7 PyTorch Version (if applicable): 1.8
After using nsys tool to profile the program, I have found that int8 quantized model is not using tensor core kernal. Maybe that is the reason why int8 is running slower that fp16? Any possible reasons, or suggestions how can I enable tensor core with int8 quantized lstm model? Thanks in advance.
In my understanding, A100 GPUs I am using are supposed to work with int8 tensor cores.
for https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/calibrate_quant_resnet50.ipynb it is a basic sample that use auto QAT, like for resnet there is advanced topic on how to speed up the inference , please check https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html#further-optimization
there is a more detailed document from TRT perspective https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qdq-placement-recs
for https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/calibrate_quant_resnet50.ipynb it is a basic sample that use auto QAT, like for resnet there is advanced topic on how to speed up the inference , please check https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html#further-optimization
there is a more detailed document from TRT perspective https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qdq-placement-recs
@ttyio Thanks for the reply. I have already tried the QAT sample of resnet, the method mentioned in advanced topic indeed speed up the inference. But as for LSTM, it is totally a different story. Since there is only one LSTM layer in my model, users do not have the chance to optimize quantization node (like residual add optimization for resnet) while using pytorch quantization tool.
@ttyio Hi, I was working on speeding up a ResNet18 model with trt support. I followed the instructions mentioned in this userguide, I modified my ResNet.py according to this tutorial ,performed calibration and exported the calibrated model to onnx. However when I deployed my quantized model, a slower performance is witnessed. (C++ API and trtexec gave the same result)
quantized_model.onnx: (int8)
./trtexec --onnx=/home/xxx/projects/resnet18_onnx_trt/quant_resnet18.onnx --output=prob --int8 --maxBatch=10
Latency: min = 0.586945 ms, max = 2.11331 ms, mean = 0.634796 ms, median = 0.5979 ms, percentile(99%) = 0.873779 ms
origin model:(fp16)
./trtexec --onnx=/home/xxx/projects/resnet18_onnx_trt/resnet18.onnx --fp16 --maxBatch=10
Latency: min = 0.479614 ms, max = 0.661621 ms, mean = 0.490914 ms, median = 0.489624 ms, percentile(99%) = 0.52179 ms
One of my colleagues following this tutorial and modified her YoloV5 detection code also encountered performance drop.
So any suggestions how to fix this problem?
onnx models have been uploaded to reproduce the problem
TensorRT Version: 8.0.3.4-1+cuda10.2 NVIDIA GPU: RTX 2080 NVIDIA Driver Version: 450.66 CUDA Version: cuda_10_2 CUDNN Version: cudnn8 Operating System: Ubuntu 16.04 PyTorch Version (if applicable): 1.10.0 Baremetal or Container (if so, version): bleakie/cuda10.2_cudnn8.0_ubuntu16.04
@ttyio Hi, I was working on speeding up a ResNet18 model with trt support. I followed the instructions mentioned in this userguide, I modified my ResNet.py according to this tutorial ,performed calibration and exported the calibrated model to onnx. However when I deployed my quantized model, a slower performance is witnessed. (C++ API and trtexec gave the same result)
using trtext
quantized_model.onnx: (int8)
./trtexec --onnx=/home/xxx/projects/resnet18_onnx_trt/quant_resnet18.onnx --output=prob --int8 --maxBatch=10
Latency: min = 0.586945 ms, max = 2.11331 ms, mean = 0.634796 ms, median = 0.5979 ms, percentile(99%) = 0.873779 ms
origin model:(fp16)
./trtexec --onnx=/home/xxx/projects/resnet18_onnx_trt/resnet18.onnx --fp16 --maxBatch=10
Latency: min = 0.479614 ms, max = 0.661621 ms, mean = 0.490914 ms, median = 0.489624 ms, percentile(99%) = 0.52179 ms
One of my colleagues following this tutorial and modified her YoloV5 detection code also encountered performance drop.
So any suggestions how to fix this problem?
onnx models have been uploaded to reproduce the problem
Environment
TensorRT Version: 8.0.3.4-1+cuda10.2 NVIDIA GPU: RTX 2080 NVIDIA Driver Version: 450.66 CUDA Version: cuda_10_2 CUDNN Version: cudnn8 Operating System: Ubuntu 16.04 PyTorch Version (if applicable): 1.10.0 Baremetal or Container (if so, version): bleakie/cuda10.2_cudnn8.0_ubuntu16.04 @ttyio Hey I meet the same erro any update?
@IAMLYCHEE @liuanhua110 When QAT runs slower than FP16, it usually means that the Q/DQ placement is not optimal. Could you share your quantized ONNX model(s) so that we can tell you where to add/remove Q/DQ ops? Thanks
Closing due to >14 days without activity. Please feel free to reopen if the issue still exists. Thanks
@nvpohanh help me Question about INT8 quantization Slower
@IAMLYCHEE @liuanhua110 When QAT runs slower than FP16, it usually means that the Q/DQ placement is not optimal. Could you share your quantized ONNX model(s) so that we can tell you where to add/remove Q/DQ ops? Thanks
hi, nvpohanh, following file is the quantized onnx model according to official guide. Could u please help check the Q/DQ placement so that the inference speed could be more reasonable. https://drive.google.com/file/d/16inPpOfaJWXjtXn_fOmBXWMhp56o_Ux9/view?usp=sharing
Description
So I used the PTQ sample code to do quantization from fp16 to int8 My model is a deepfake auto-encoder, the PTQ int8 output image results is correct with little loss in accuracy The model went from
1.47 Gb
(Original fp16) to370 Mb
(PTQ int8), However, during inference on windows, usingtrtexec.exe
to profile latency, the inference speed of int8 (15.1957 ms
) is slower than original fp16 (12.1411 ms
) see the dumpProfiles links below:Anything i'm doing wrong on the inference side or PTQ?
Environment
TensorRT Version: 8.0.1.6 NVIDIA GPU: RTX 2080 NVIDIA Driver Version: 471.11 CUDA Version: cuda_11.3.1_465.89_win10 CUDNN Version: cudnn-11.3-windows-x64-v8.2.1.32 Operating System: Windows10 Python Version (if applicable): 3.7.8 Tensorflow Version (if applicable): None PyTorch Version (if applicable): 1.10.0a0+3fd9dcf Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:21.08-py3 NVIDIA Release 21.08 (build 26011915)