NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
541 stars 39 forks source link

Tried to apply PTQ to a basic CV CNN network and got slower model in the end? #14

Open tmagcaya opened 5 months ago

tmagcaya commented 5 months ago

I used mtq.INT8_default_CFG as recommended for CNN networks (mtq.quantize(model, config, forward_loop). My initial model ran at 80FPS after quantization it dropped to 40FPS? I checked the model structure and it seems like all of my Conv2d layers became QuantConv2d with input and output quantizers as Tensor Quantizer after quantize function call.

Has this been tried for simple FPN or Resnet type models?

realAsma commented 5 months ago

ModelOpt quantization is fake quantization, which means it only simulates the low-precision computation in PyTorch. Real speedup and memory saving should be achieved by exporting the model to deployment frameworks. Learn more here

In this case, the model is a CNN model - the quantized model should be deployed to TensorRT.

Please see an example for exporting to ONNX and deploying to TensorRT here - https://github.com/NVIDIA/TensorRT/tree/main/samples/python/detectron2

tmagcaya commented 5 months ago

Got it, is there a difference between tensorrt quantization vs this type of quantization for CNN models then? I think for CNN we can only use the int8_default config quantization method.

realAsma commented 5 months ago

The simulated quantization from modelopt allows:

  1. Calibrate the CNN model for real quantization with tensorrt (such as collect the weight and activation statistics such as amax to compute quantization scales. These scales will be exported during ONNX export and used while building tensorrt quantized engine).
  2. Quantization aware training (QAT): QAT could help improve the model accuracy beyond PTQ. modelopt simulated quantization allows you to perform QAT. The inference speed up is achieved by deploying this quantized model to tensorrt.
tmagcaya commented 5 months ago

I have yet to try QAT but for PTQ here are there results I got. I'm confused, can you help me understand why int8 had a lower fps than fp16 even though it was quantized down to half precision.

Model without compression: 21.8 mAP, 40.1 fps Model with TensorRT: 21.8 mAP, 75.8 fps Model with TRT+fp16 through engine config params no modelopt: 0.15 mAP, 132 fps Model with modelopt_int8+TRT: 0.1 mAP, 95 fps

cjluo-omniml commented 5 months ago

@tmagcaya could you share your model architecture so that we can reproduce your result?

realAsma commented 5 months ago

My apologies on the performance degradation. modelopt.torch.quantization speedup analysis have been focused on LLMs (deployed via TensorRT-LLM) and diffusion models (deployed via TRT).

It is quite possible for quantized models other than the LLMs or diffusion models to be slower than TRT's un-quantized baseline models. In this case, we recommend exporting the Pytorch model to ONNX first and then quantize the ONNX graph via modelopt.onnx.quantization. Please see examples for quantizing ONNX graph here.

Could you please try out quantizing the ONNX graph using modelopt.onnx.quantization instead?

However one limitation of this approach is that modelopt.onnx.quantization does not support QAT. I recommend trying out PTQ first with modelopt.onnx.quantization first - if the accuracy is not bad, QAT would not be needed (QAT could improve accuracy, but does not give additional speedup over PTQ).

tmagcaya commented 5 months ago

I'm sorry the architecture is proprietary, but I'll try to replicate the issue on an open architecture if I can find some time in the next few weeks.

korkland commented 3 months ago

sorrt quantization vs this type of quantization for CNN models then? I think for CNN we can only use the int8_default config quantization method.

Hi, I'm kind of in the same situation, with propriety model aiming to PTQ it and build it using tensorrt. And I'm experiencing the same degradation with int8 vs. fp16 in RetinaNet based model, using the modelopt PyTorch API. Even when I'm using the ONNX API the the improvement is poor compared to fp16. Have you found a better approach for quantization that gives you better latency with more control over the quantization process (not the implicit one)?

riyadshairi979 commented 2 months ago

Even when I'm using the ONNX API the the improvement is poor compared to fp16.

Is this Retinanet similar to your base model? This one compiles to TensorRT 10 successfully. Quantize the model using modelopt and compile using trtexec:

$ python -m modelopt.onnx.quantization --onnx_path=retinanet-9.onnx --quantize_mode=int8
$ trtexec --onnx=retinanet-9.quant.onnx --saveEngine=retinanet-9.quant.engine --best

We observe 1.7x latency reduction of retinanet-9.quant.engine compared to fp16 TensorRT engine.

Raj-vivid commented 1 month ago

Having the same issue here. I tried benching my quantized onnx and it doesn't seem to provide any speedup , instead makes it slower. Not sure what I am doing wrong. It's just a pretrained convnextv2 (nano). It has a layer norm but I would presume conv and linear layers being quantized to int8 should provide a boost.

tmagcaya commented 1 month ago

This might be relevant to this discussion: https://forums.developer.nvidia.com/t/trt-engin-in-int8-is-much-slower-than-fp16/193755/4

Raj-vivid commented 1 month ago

This might be relevant to this discussion: https://forums.developer.nvidia.com/t/trt-engin-in-int8-is-much-slower-than-fp16/193755/4

Setting builder_config.set_flag(trt.BuilderFlag.FP16) did increase my inference speed a bit but it's a very small increase and slower than fp32. Still expecting my model to be faster. I can see that by default activations are not quantized maybe that's why. Did you end up setting both fp16 and int8 ? Weirdly when I set to fp16 I get slightly faster than when I set to int8 and fp16 both, yet always slower than baseline.

tmagcaya commented 3 weeks ago

@realAsma Finally got around to testing on a public dataset. Please see the MNIST quantization results in attached notebook. For some reason, the frames per second drop after quantization...

MNIST Benchmarking Notes: Mnistoptimization.md

fp32:

fp16: 24437.6 qps

best: 25888 qps

Int8 Quant with ModelOpt:

fp16: 22496 qps

best: 22073.8 qps