NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
331 stars 19 forks source link

How does Model Optimizer compare with the default Tensorrt 10 Optimisations? #5

Closed yuvraj108c closed 1 month ago

yuvraj108c commented 2 months ago

e.g convert onnx to tensorrt using trtexec v/s using this model optimizer to generate the optimised tensorrt engine?

Which engine will give faster speeds?

### Tasks
riyadshairi979 commented 2 months ago

e.g convert onnx to tensorrt using trtexec v/s using this model optimizer to generate the optimised tensorrt engine?

Which engine will give faster speeds?

Assuming this question is related to ONNX quantization, certain models can achieve faster speeds with the default TensorRT 10 implicit quantization (IQ), but ModelOpt's explicit quantization (EQ) will generally yield faster speeds for most network types, especially for Vision Transformers (ViTs). If users find that their network is not faster with ModelOpt quantization, they should file a bug report with a reproducible example.

TensorRT will depreciate IQ in future releases and recommend EQ for users seeking better speed improvements and accuracy control.

yuvraj108c commented 2 months ago

The most popular tensorrt optimisation pipeline is to first export a .pth model to .onnx, e.g

import torch

model = torch.load('model.pth')
model.eval()

x = torch.rand(1, 3, 512, 512)
x = x.cuda()

torch.onnx.export(model,
                    x,
                    "./model.onnx",
                    verbose=True,
                    input_names=['input'],
                    output_names=['output'],
                    opset_version=17,
                    export_params=True
                    )

then, convert the onnx model to tensorrt, e.g using trtexec

trtexec --onnx=model.onnx --saveEngine=model.engine --fp16

My question is, where does this Tensorrt-Model-Optimizer fit into all of this, and will it give better speeds than the current pipeline.

riyadshairi979 commented 2 months ago

My question is, where does this Tensorrt-Model-Optimizer fit into all of this, and will it give better speeds than the current pipeline.

Users can use Tensorrt-Model-Optimizer (aka modelopt) to deploy models with lower precisions like fp8, int8 and int4. For fp16 engine deployment, users won't need modelopt.

Typically, users would generate the TensorRT engine with int8 precision using the following simplified command:

trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --int8

With above, TensorRT will use implicit quantization (IQ) to select the precision of different layers of the model. While IQ can still work better for some CNN models in terms of performance, it does not support Transformer based models, lacks reproducibility, provides no accuracy control for the users and only supports int8 quantization mode. modelopt aims to alleviate most of these issues.

With modelopt, users will do these two steps instead, explicit quantization (EQ) followed by trtexec compilation:

python -m modelopt.onnx.quantization --onnx_path=model.onnx --quantize_mode=int8 --calibration_data=calib.npy --output_path=model.quant.onnx
trtexec --onnx=model.quant.onnx --saveEngine=model.engine --fp16 --int8

In case of int8, while explicit quantization (EQ) is anticipated to enhance the accuracy of the model, it is not assured to yield better speeds compared to the current implicit quantization (IQ) pipeline for all network types. For instance, certain models may still achieve faster inference speeds using the IQ workflow. Thus, it's crucial to assess the specific characteristics of your model to determine the most appropriate optimization approach.

yuvraj108c commented 2 months ago

Thanks for the clarifications! It now makes more sense.

--fp16 --int8 can be used at the same time?

riyadshairi979 commented 2 months ago

--fp16 --int8 can be used at the same time?

Yes, it is equivalent to using --best.