NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.84k stars 2.14k forks source link

How tensorRT load a quantization onnx model #2685

Closed zhanghuqiang closed 1 year ago

zhanghuqiang commented 1 year ago

Description

I export using these code ,which is copy from https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/userguide.html#export-to-onnx Here's the code:

from pytorch_quantization import nn as quant_nn
import torch
from torchvision import models as thmodels
from pytorch_quantization import quant_modules
quant_modules.initialize()
model = thmodels.resnet50()

model.cuda()
model.eval()
dummpy_inputs = torch.randn(32, 3, 224, 224, device='cuda')
quant_nn.TensorQuantizer.use_fb_fake_quant = True
torch.onnx.export(model, dummpy_inputs,'test.onnx', opset_version=13)

and get some warning

WARNING: Logging before flag parsing goes to stderr.
W0215 18:25:03.595652 28296 tensor_quantizer.py:281] Use Pytorch's native experimental fake quantization.
C:\Users\zhang\anaconda3\envs\torchConda\lib\site-packages\pytorch_quantization\nn\modules\tensor_quantizer.py:284: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if amax.numel() == 1:
C:\Users\zhang\anaconda3\envs\torchConda\lib\site-packages\pytorch_quantization\nn\modules\tensor_quantizer.py:286: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  inputs, amax.item() / bound, 0,
C:\Users\zhang\anaconda3\envs\torchConda\lib\site-packages\pytorch_quantization\utils\reduce_amax.py:61: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if not keepdims or output.numel() == 1:
C:\Users\zhang\anaconda3\envs\torchConda\lib\site-packages\pytorch_quantization\nn\modules\tensor_quantizer.py:292: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  quant_dim = list(amax.shape).index(list(amax_sequeeze.shape)[0])

and the onnx seems to be exported successfully. But when I try load it using tensorRT,

auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(Util::gLogger));
if (!builder)
    return false;

const auto explicitBatch = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(explicitBatch));
if (!network)
    return false;

auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
if (!config)
    return false;

auto parser = SampleUniquePtr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, Util::gLogger));
if (!parser)
    return false;

auto parsed = parser->parseFromFile(modelPath.c_str(), static_cast<int>(nvinfer1::ILogger::Severity::kVERBOSE));
if (!parsed)
    return false;

the error information is

[MemUsageChange] Init CUDA: CPU +442, GPU +0, now: CPU 9690, GPU 1192 (MiB)
[MemUsageChange] Init builder kernel library: CPU +211, GPU +68, now: CPU 10092, GPU 1260 (MiB)
----------------------------------------------------------------
Input filename:   D:\Code\PyTorch_classification\test.onnx
ONNX IR version:  0.0.7
Opset version:    13
Producer name:    pytorch
Producer version: 1.12.0
Domain:
Model version:    0
Doc string:
----------------------------------------------------------------
onnx::QuantizeLinear_1096: invalid weights type of Int8
ModelImporter.cpp:773: While parsing node number 0 [Identity -> "onnx::QuantizeLinear_1257"]:
ModelImporter.cpp:774: --- Begin node ---
ModelImporter.cpp:775: input: "onnx::QuantizeLinear_1096"
output: "onnx::QuantizeLinear_1257"
name: "Identity_0"
op_type: "Identity"

ModelImporter.cpp:776: --- End node ---
ModelImporter.cpp:779: ERROR: ModelImporter.cpp:180 In function parseGraph:
[6] Invalid Node - Identity_0
onnx::QuantizeLinear_1096: invalid weights type of Int8

Is there something wrong with me, or just a BUG.

Environment

TensorRT Version: 8.4.1.5 NVIDIA GPU: 2060 NVIDIA Driver Version: CUDA Version: 11.6 CUDNN Version: 8.4.1.50 Operating System: windows Python Version (if applicable): 3.8 Tensorflow Version (if applicable): PyTorch Version (if applicable): 1.12 Baremetal or Container (if so, version):

Relevant Files

Steps To Reproduce

zerollzeng commented 1 year ago

Looks like something is wrong with your onnx, @ttyio any suggestion here?

ttyio commented 1 year ago

@zhanghuqiang , we have constant(int8) + dq support since TRT8.5. For 8.4, please disable constant folding during the onnx export, thx!

       torch.onnx.export(..., do_constant_folding=False, ...)
anilknayak commented 1 year ago

I also have the same question. Is there any difference in loading the trt serialized engine for Fp32, Fp16, Int8?

I did the following approach

  1. I have a onnx model with Fp32 precision
  2. I have build a INT8 and Fp16 separate trt engine from ONNX Fp32 model

When I load (deserialize) the fp16trt engine and ran the inference on ARM64 gives same inference time as INT8 trt engine.

Our ONNX model is a detection model (efficientdet model).

Could please help me to understand the process of deserializing the model int8 so that we can have benefit of inference time.

ttyio commented 1 year ago

@anilknayak sorry for the late response, the deserialize is the same for engine file with different precision. The kernels are already selected when saving the engine file. Make sure engine build and inference are in the same device. Also we can enable verbose log level during the engine build process to get more details.

ttyio commented 1 year ago

closing since no activity for more than 3 weeks, pls reopen if you still have question, thanks!