NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, sparsity, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
434 stars 27 forks source link

CNN model opt int8 best practice example #46

Open korkland opened 1 month ago

korkland commented 1 month ago

Hi, can you share best practices for quantization for CNN models? Are the modelopt quantized PTQ is the way to go with tensorrt for cnn models (resnet retinanet etc)? I was able to quantize retinanet backbone to int8 but the lack of examples and practices makes me wonder if that is the way to go..

Thanks

riyadshairi979 commented 1 month ago

See the example of how to quantize CNN/ViTs using modelopt and deploy/evaluate with TensorRT. This is the recommended practice but note that TensorRT's implicit quantization may provide better performance for certain models. Please create an issue with reproducible instructions (model, command etc.) if thats the case.

korkland commented 1 month ago

I must say that I'm confused by the options that NVIDIA provides for quantization. We are targeting the Orin architecture and have our own CNN model based on RetinaNet. With the previous vendor, it was very clear: they had one tool. You would take your PyTorch model, convert it to ONNX, and use their tool for quantization, provide it config with nodes you want to quantize, calibration data etc..

With NVIDIA, there are too many options and we didn't found an option the satisfied our needs.

There is an implicit quantization ,which btw is deprecated from TRT 10, so i think we shouldn't go with this direction. I've tried it and it doesn't work out of the box - I'm getting this error maybe someone could help:

trtexec --onnx=orig.onnx --saveEngine=orig.trt --best

[shapeMachine.cpp::executeContinuation::905] Error Code 7: Internal Error (/interpret_2d/nms/strategy/Expand_1: ISliceLayer has out of bounds access on axis 0 Out of bounds access for slice. Instruction: CHECK_SLICE 287 0 300 1.)

is there an option to exclude the whole interpret_2d in implicit?

And explicit quantization:

Is there an option to manually add quantizers/dequantizers to ONNX quantization?

btw, when using quant_pre_process the engine generation failed on the following, incase some could help:

[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:828: While parsing node number 177 [ScatterND -> "/interpret_2d/nms/strategy/ScatterND_output_0"]:
[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:831: --- Begin node ---
input: "/interpret_2d/nms/strategy/Constant_17_output_0"
input: "/interpret_2d/nms/strategy/Constant_19_output_0"
input: "/interpret_2d/nms/strategy/Reshape_3_output_0"
output: "/interpret_2d/nms/strategy/ScatterND_output_0"
name: "/interpret_2d/nms/strategy/ScatterND"
op_type: "ScatterND"
attribute {
  name: "reduction"
  s: "none"
  type: STRING
}

[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:832: --- End node ---
[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:836: ERROR: onnxOpImporters.cpp:5119 In function importScatterND:
[9] Assertion failed: !attrs.count("reduction"): Attribute reduction is not supported.
[07/24/2024-09:02:34] [E] Failed to parse onnx file

Thanks

riyadshairi979 commented 1 month ago

trtexec --onnx=orig.onnx --saveEngine=orig.trt --best

If the original model doesn't compile with trtexec, you will need to fix the ONNX model first before quantizing it. You can file an issue here with reproducible model and commands.

Is this Retinanet similar to yours? It compiles to TensorRT 10 successfully. Quantize the model using modelopt and compile using trtexec: $ python -m modelopt.onnx.quantization --onnx_path=retinanet-9.onnx --quantize_mode=int8 $ trtexec --onnx=retinanet-9.quant.onnx --saveEngine=retinanet-9.quant.engine --best We observe 1.7x latency reduction of retinanet-9.quant.engine compared to fp16 TensorRT engine.