NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.76k stars 2.13k forks source link

PTQ quantization int8 is slower than fp16 #1532

Closed joytsay closed 2 years ago

joytsay commented 3 years ago

Description

So I used the PTQ sample code to do quantization from fp16 to int8 My model is a deepfake auto-encoder, the PTQ int8 output image results is correct with little loss in accuracy The model went from 1.47 Gb (Original fp16) to 370 Mb (PTQ int8), However, during inference on windows, using trtexec.exe to profile latency, the inference speed of int8 (15.1957 ms) is slower than original fp16 (12.1411 ms) see the dumpProfiles links below:

Anything i'm doing wrong on the inference side or PTQ?

Environment

TensorRT Version: 8.0.1.6 NVIDIA GPU: RTX 2080 NVIDIA Driver Version: 471.11 CUDA Version: cuda_11.3.1_465.89_win10 CUDNN Version: cudnn-11.3-windows-x64-v8.2.1.32 Operating System: Windows10 Python Version (if applicable): 3.7.8 Tensorflow Version (if applicable): None PyTorch Version (if applicable): 1.10.0a0+3fd9dcf Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:21.08-py3 NVIDIA Release 21.08 (build 26011915)

anxietymonger commented 3 years ago

Encountering akin performance drop problem from fp16 to int8 with LSTM model.

Environment

TensorRT Version: 8.2.0.6EA GPU Type: A100 Nvidia Driver Version: 455.23.05 CUDA Version: 11.1 CUDNN Version: 8.0 Operating System + Version: CentOS 8.1 Python Version (if applicable): 3.7 PyTorch Version (if applicable): 1.8

anxietymonger commented 3 years ago

After using nsys tool to profile the program, I have found that int8 quantized model is not using tensor core kernal. Maybe that is the reason why int8 is running slower that fp16? Any possible reasons, or suggestions how can I enable tensor core with int8 quantized lstm model? Thanks in advance.

In my understanding, A100 GPUs I am using are supposed to work with int8 tensor cores.

ttyio commented 3 years ago

for https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/calibrate_quant_resnet50.ipynb it is a basic sample that use auto QAT, like for resnet there is advanced topic on how to speed up the inference , please check https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html#further-optimization

there is a more detailed document from TRT perspective https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qdq-placement-recs

anxietymonger commented 2 years ago

for https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/calibrate_quant_resnet50.ipynb it is a basic sample that use auto QAT, like for resnet there is advanced topic on how to speed up the inference , please check https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html#further-optimization

there is a more detailed document from TRT perspective https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qdq-placement-recs

@ttyio Thanks for the reply. I have already tried the QAT sample of resnet, the method mentioned in advanced topic indeed speed up the inference. But as for LSTM, it is totally a different story. Since there is only one LSTM layer in my model, users do not have the chance to optimize quantization node (like residual add optimization for resnet) while using pytorch quantization tool.

IAMLYCHEE commented 2 years ago

@ttyio Hi, I was working on speeding up a ResNet18 model with trt support. I followed the instructions mentioned in this userguide, I modified my ResNet.py according to this tutorial ,performed calibration and exported the calibrated model to onnx. However when I deployed my quantized model, a slower performance is witnessed. (C++ API and trtexec gave the same result)

using trtext

quantized_model.onnx: (int8)

./trtexec --onnx=/home/xxx/projects/resnet18_onnx_trt/quant_resnet18.onnx --output=prob --int8 --maxBatch=10

Latency: min = 0.586945 ms, max = 2.11331 ms, mean = 0.634796 ms, median = 0.5979 ms, percentile(99%) = 0.873779 ms

origin model:(fp16)

./trtexec --onnx=/home/xxx/projects/resnet18_onnx_trt/resnet18.onnx --fp16 --maxBatch=10

Latency: min = 0.479614 ms, max = 0.661621 ms, mean = 0.490914 ms, median = 0.489624 ms, percentile(99%) = 0.52179 ms

One of my colleagues following this tutorial and modified her YoloV5 detection code also encountered performance drop.

So any suggestions how to fix this problem?

onnx models have been uploaded to reproduce the problem

Environment

TensorRT Version: 8.0.3.4-1+cuda10.2 NVIDIA GPU: RTX 2080 NVIDIA Driver Version: 450.66 CUDA Version: cuda_10_2 CUDNN Version: cudnn8 Operating System: Ubuntu 16.04 PyTorch Version (if applicable): 1.10.0 Baremetal or Container (if so, version): bleakie/cuda10.2_cudnn8.0_ubuntu16.04

liuanhua110 commented 2 years ago

@ttyio Hi, I was working on speeding up a ResNet18 model with trt support. I followed the instructions mentioned in this userguide, I modified my ResNet.py according to this tutorial ,performed calibration and exported the calibrated model to onnx. However when I deployed my quantized model, a slower performance is witnessed. (C++ API and trtexec gave the same result)

using trtext

quantized_model.onnx: (int8)

./trtexec --onnx=/home/xxx/projects/resnet18_onnx_trt/quant_resnet18.onnx --output=prob --int8 --maxBatch=10

Latency: min = 0.586945 ms, max = 2.11331 ms, mean = 0.634796 ms, median = 0.5979 ms, percentile(99%) = 0.873779 ms

origin model:(fp16)

./trtexec --onnx=/home/xxx/projects/resnet18_onnx_trt/resnet18.onnx --fp16 --maxBatch=10

Latency: min = 0.479614 ms, max = 0.661621 ms, mean = 0.490914 ms, median = 0.489624 ms, percentile(99%) = 0.52179 ms

One of my colleagues following this tutorial and modified her YoloV5 detection code also encountered performance drop.

So any suggestions how to fix this problem?

onnx models have been uploaded to reproduce the problem

Environment

TensorRT Version: 8.0.3.4-1+cuda10.2 NVIDIA GPU: RTX 2080 NVIDIA Driver Version: 450.66 CUDA Version: cuda_10_2 CUDNN Version: cudnn8 Operating System: Ubuntu 16.04 PyTorch Version (if applicable): 1.10.0 Baremetal or Container (if so, version): bleakie/cuda10.2_cudnn8.0_ubuntu16.04 @ttyio Hey I meet the same erro any update?

nvpohanh commented 2 years ago

@IAMLYCHEE @liuanhua110 When QAT runs slower than FP16, it usually means that the Q/DQ placement is not optimal. Could you share your quantized ONNX model(s) so that we can tell you where to add/remove Q/DQ ops? Thanks

nvpohanh commented 2 years ago

Closing due to >14 days without activity. Please feel free to reopen if the issue still exists. Thanks

tonyskypc commented 2 years ago

@nvpohanh help me Question about INT8 quantization Slower

IAMLYCHEE commented 2 years ago

@IAMLYCHEE @liuanhua110 When QAT runs slower than FP16, it usually means that the Q/DQ placement is not optimal. Could you share your quantized ONNX model(s) so that we can tell you where to add/remove Q/DQ ops? Thanks

hi, nvpohanh, following file is the quantized onnx model according to official guide. Could u please help check the Q/DQ placement so that the inference speed could be more reasonable. https://drive.google.com/file/d/16inPpOfaJWXjtXn_fOmBXWMhp56o_Ux9/view?usp=sharing