NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
576 stars 43 forks source link

An error occurred when converting the ONNX model to a TensorRT engine. #10

Closed ymgwjk closed 5 months ago

ymgwjk commented 6 months ago

I encountered an error when converting my int8 onnx model to tensorrt engine. The quantization and exporting code is:

import modelopt.torch.quantization as mtq  
model = Model().cuda()
model.eval()

val_loader = cfg.val_dataloader

config = mtq.INT8_SMOOTHQUANT_CFG

def forward_loop(model):
    for batch in val_loader:
        img, target = batch
        img = img.cuda()
        model(img)

model = mtq.quantize(model, config, forward_loop)

mtq.print_quant_summary(model)

data = torch.rand(1, 3, 640, 640).cuda()

torch.onnx.export(
    model, 
    data, 
    args.file_name,
    input_names=['images'],
    output_names=['pred_logits', 'pred_boxes'],
    opset_version=16, 
    verbose=False
)

The TensorRT version is 8.6.1.6, and when I try to convert the onnx model to tensorrt engine with this command

$ trtexec --onnx=model-int8.onnx --saveEngine=model-int8.engine --int8

, I encountered the following error:

[05/18/2024-09:39:43] [E] Error[10]: Could not find any implementation for node /model/encoder/fpn_blocks.0/conv1/conv/input_quantizer/QuantizeLinear_clone_0.
[05/18/2024-09:39:44] [E] Error[10]: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node /model/encoder/fpn_blocks.0/conv1/conv/input_quantizer/QuantizeLinear_clone_0.)

It seems like the mtq.quantize inserted some quantized layers but the tensorrt can't find the implementations for them. So what is the proper way to quantize model and convert it to a TensorRT engine?

riyadshairi979 commented 6 months ago

We will need to see the torch code to be sure but I guess the resultant QDQ placement (aka selection of operations for quantization) is not supported by TensorRT, may be a quantized conv weights got shared. Please share the torch and/or ONNX model with us to have a look. Also you can export the torch model to ONNX first and use modelopt.onnx.quantization tool to do PTQ before deploying with TensorRT, though smoothquant is not supported in that workflow. cc @realAsma

realAsma commented 6 months ago

We will need to see the torch code to be sure but I guess the resultant QDQ placement (aka selection of operations for quantization) is not supported by TensorRT, may be a quantized conv weights got shared. Please share the torch and/or ONNX model with us to have a look. Also you can export the torch model to ONNX first and use modelopt.onnx.quantization tool to do PTQ before deploying with TensorRT, though smoothquant is not supported in that workflow. cc @realAsma

I agree with @riyadshairi979. TensorRT does not support INT8_SMOOTHQUANT_CFG for models with Conv layers such as CNNs. INT8_SMOOTHQUANT_CFG works well for LLMs with TensorRT - LLM.