NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.55k stars 2.1k forks source link

Accuracy loss of TensorRT 8.6 when running INT8 Quantized Resnet18 on GPU A4000 #4079

Open YixuanSeanZhou opened 1 month ago

YixuanSeanZhou commented 1 month ago

Description

When performing Resnet18 PTQ using TRT-modelopt, I encountered the following issue when compiling the model with TRT.

First off, I started with a pretrained resnet18 from torchvision. I replaced the last fully connected layer to fit on my dataset (for example, CIFRA-10). I also updated all the skip layers (the plus) with a ElementwiseAdd layer and I defined its quantization layer as follow myself (code attached at the end). The reason I do this is to facilitate the Q/DQ fusion so that every layer can be in INT8.

Then, when compiling the exported onnx model with TRT, I found that TRT outputs is very different from the fake Q/DQ model in python, and the fake Q/DQ onnx model as well when running with onnx runtime. (np.allclose with 1e-3 as the threshold failed). Comparing TRT and native output, the classification result disagrees for ~2.3%

I discussed with TRT modelopt in this issue and they suggested to file a bug report here

Environment

TensorRT Version: 8.6.1

NVIDIA GPU: A4000

NVIDIA Driver Version: 535.183.01

CUDA Version: 12.2

Python Version (if applicable): 3.10.1

PyTorch Version (if applicable): '2.4.0+cu124'

Relevant Files

Model link: You can download the onnx model and the TRT engine here: https://file.io/GnuiEMNeebQ1

Steps To Reproduce

Run the TRT model using Python API and the onnx model with Cifar-10 datasets using the following data loader, and compares the result.

testset = datasets.CIFAR10(root='./data', download=True, train=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=1, shuffle=False)

Have you tried the latest release?: Haven't tried TRT10, but we don't plan to upgrade in the short period. I was under the impression 8.6 should be okay.

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): Yes, Onnx Runtime generates 1% disagreement in the Native model

Appendix

Visualizing the TRT engine, I think it is completely within my expectation with everything being fused as INT8 kernels. trt_engine_0

lix19937 commented 3 weeks ago

Do you start with a pretrained model to QAT?

If yes, does the fp32 model (that is un-quantized) also shows inconsistent results between trt and onnxruntime ?

Do you try to change a version of trt?

Do you try to use other calibration method between finetune ?

YixuanSeanZhou commented 3 weeks ago

@lix19937 Thank you very much for helping out!

Do you start with a pretrained model to QAT?

Yes, to be specific, what I did was PTQ using TRT modelopt.

does the fp32 model (that is un-quantized) also shows inconsistent results between trt and onnxruntime

No, FP32 model's outputs when converted to TRT was almost identical to the native Pytorch model, I didn't verify onnxruntime for this.

Do you try to change a version of trt?

Unfortunately this is not easy to do right now on my side. Are you expecting this will be fixed in TRT 10? I was under the impression that QDQ quantization has been supported even earlier than 8.6, so it won't be a version issue. But correct me if I am wrong.

Do you try to use other calibration method between finetune ?

Do you mean the implicit calibration methods within TRT? If so, I am not using that. The onnx model I provided to TRT already contains the QDQ nodes. If you mean other calibration in TRT modelopt, I tried both Smoothing and the Default (MINMAX) calibration methods, they both shows the same regression.

Thanks and looking forward to your responses!

akhilg-nv commented 2 weeks ago

Following up from the discussion on your issue in TRT ModelOpt, it is possible the accuracy degradation comes for the fusions performed with Int8 convolution. Could you try removing QDQ for all convolutional layers (not just first one) and compare the accuracy with torch? You should be able to do this by modifying the config either as you've done before, or to use a filter function for convolutional layers in your torch model. It is possible your specific application may work better for now without quantizing conv layers, though this will also help us in understanding & investigating the root cause of the accuracy discrepancy you are seeing.

def filter_func(name):
    pattern = re.compile(
        r".*(|conv_in|conv_out|conv_shortcut|etc).*"
    )
    return pattern.match(name) is not None

# and/or apply filter func when creating your quantization config. see demo/Diffusion/utils_modelopt.py for an example.
mtq.disable_quantizer(model, filter_func)
YixuanSeanZhou commented 3 days ago

Got it, thanks for the follow up @akhilg-nv. Sorry for the delay, I was away last & this week. I can try the experiment you suggested next week.

To clarify: If we skip quantizing the conv layers, it means we will only have the last layer -- the fully connected classification layer. Is that okay?

It is possible your specific application may work better for now without quantizing conv layers,

This won't be true unfortunately, the goal of quantizing those layers is to achieve acceleration in the model inference latency. However, i will certainly conduct the experiment to see if it will no longer have the regression.