NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
288 stars 15 forks source link

Tried to apply PTQ to a basic CV CNN network and got slower model in the end? #14

Open tmagcaya opened 1 month ago

tmagcaya commented 1 month ago

I used mtq.INT8_default_CFG as recommended for CNN networks (mtq.quantize(model, config, forward_loop). My initial model ran at 80FPS after quantization it dropped to 40FPS? I checked the model structure and it seems like all of my Conv2d layers became QuantConv2d with input and output quantizers as Tensor Quantizer after quantize function call.

Has this been tried for simple FPN or Resnet type models?

realAsma commented 1 month ago

ModelOpt quantization is fake quantization, which means it only simulates the low-precision computation in PyTorch. Real speedup and memory saving should be achieved by exporting the model to deployment frameworks. Learn more here

In this case, the model is a CNN model - the quantized model should be deployed to TensorRT.

Please see an example for exporting to ONNX and deploying to TensorRT here - https://github.com/NVIDIA/TensorRT/tree/main/samples/python/detectron2

tmagcaya commented 1 month ago

Got it, is there a difference between tensorrt quantization vs this type of quantization for CNN models then? I think for CNN we can only use the int8_default config quantization method.

realAsma commented 1 month ago

The simulated quantization from modelopt allows:

  1. Calibrate the CNN model for real quantization with tensorrt (such as collect the weight and activation statistics such as amax to compute quantization scales. These scales will be exported during ONNX export and used while building tensorrt quantized engine).
  2. Quantization aware training (QAT): QAT could help improve the model accuracy beyond PTQ. modelopt simulated quantization allows you to perform QAT. The inference speed up is achieved by deploying this quantized model to tensorrt.
tmagcaya commented 1 month ago

I have yet to try QAT but for PTQ here are there results I got. I'm confused, can you help me understand why int8 had a lower fps than fp16 even though it was quantized down to half precision.

Model without compression: 21.8 mAP, 40.1 fps Model with TensorRT: 21.8 mAP, 75.8 fps Model with TRT+fp16 through engine config params no modelopt: 0.15 mAP, 132 fps Model with modelopt_int8+TRT: 0.1 mAP, 95 fps

cjluo-omniml commented 1 month ago

@tmagcaya could you share your model architecture so that we can reproduce your result?

realAsma commented 1 month ago

My apologies on the performance degradation. modelopt.torch.quantization speedup analysis have been focused on LLMs (deployed via TensorRT-LLM) and diffusion models (deployed via TRT).

It is quite possible for quantized models other than the LLMs or diffusion models to be slower than TRT's un-quantized baseline models. In this case, we recommend exporting the Pytorch model to ONNX first and then quantize the ONNX graph via modelopt.onnx.quantization. Please see examples for quantizing ONNX graph here.

Could you please try out quantizing the ONNX graph using modelopt.onnx.quantization instead?

However one limitation of this approach is that modelopt.onnx.quantization does not support QAT. I recommend trying out PTQ first with modelopt.onnx.quantization first - if the accuracy is not bad, QAT would not be needed (QAT could improve accuracy, but does not give additional speedup over PTQ).

tmagcaya commented 1 month ago

I'm sorry the architecture is proprietary, but I'll try to replicate the issue on an open architecture if I can find some time in the next few weeks.