Open tmagcaya opened 5 months ago
ModelOpt quantization is fake quantization, which means it only simulates the low-precision computation in PyTorch. Real speedup and memory saving should be achieved by exporting the model to deployment frameworks. Learn more here
In this case, the model is a CNN model - the quantized model should be deployed to TensorRT.
Please see an example for exporting to ONNX and deploying to TensorRT here - https://github.com/NVIDIA/TensorRT/tree/main/samples/python/detectron2
Got it, is there a difference between tensorrt quantization vs this type of quantization for CNN models then? I think for CNN we can only use the int8_default config quantization method.
The simulated quantization from modelopt allows:
I have yet to try QAT but for PTQ here are there results I got. I'm confused, can you help me understand why int8 had a lower fps than fp16 even though it was quantized down to half precision.
Model without compression: 21.8 mAP, 40.1 fps Model with TensorRT: 21.8 mAP, 75.8 fps Model with TRT+fp16 through engine config params no modelopt: 0.15 mAP, 132 fps Model with modelopt_int8+TRT: 0.1 mAP, 95 fps
@tmagcaya could you share your model architecture so that we can reproduce your result?
My apologies on the performance degradation. modelopt.torch.quantization speedup analysis have been focused on LLMs (deployed via TensorRT-LLM) and diffusion models (deployed via TRT).
It is quite possible for quantized models other than the LLMs or diffusion models to be slower than TRT's un-quantized baseline models. In this case, we recommend exporting the Pytorch model to ONNX first and then quantize the ONNX graph via modelopt.onnx.quantization. Please see examples for quantizing ONNX graph here.
Could you please try out quantizing the ONNX graph using modelopt.onnx.quantization
instead?
However one limitation of this approach is that modelopt.onnx.quantization
does not support QAT. I recommend trying out PTQ first with modelopt.onnx.quantization
first - if the accuracy is not bad, QAT would not be needed (QAT could improve accuracy, but does not give additional speedup over PTQ).
I'm sorry the architecture is proprietary, but I'll try to replicate the issue on an open architecture if I can find some time in the next few weeks.
sorrt quantization vs this type of quantization for CNN models then? I think for CNN we can only use the int8_default config quantization method.
Hi, I'm kind of in the same situation, with propriety model aiming to PTQ it and build it using tensorrt. And I'm experiencing the same degradation with int8 vs. fp16 in RetinaNet based model, using the modelopt PyTorch API. Even when I'm using the ONNX API the the improvement is poor compared to fp16. Have you found a better approach for quantization that gives you better latency with more control over the quantization process (not the implicit one)?
Even when I'm using the ONNX API the the improvement is poor compared to fp16.
Is this Retinanet similar to your base model? This one compiles to TensorRT 10 successfully. Quantize the model using modelopt and compile using trtexec:
$ python -m modelopt.onnx.quantization --onnx_path=retinanet-9.onnx --quantize_mode=int8
$ trtexec --onnx=retinanet-9.quant.onnx --saveEngine=retinanet-9.quant.engine --best
We observe 1.7x latency reduction of retinanet-9.quant.engine compared to fp16 TensorRT engine.
Having the same issue here. I tried benching my quantized onnx and it doesn't seem to provide any speedup , instead makes it slower. Not sure what I am doing wrong. It's just a pretrained convnextv2 (nano). It has a layer norm but I would presume conv and linear layers being quantized to int8 should provide a boost.
This might be relevant to this discussion: https://forums.developer.nvidia.com/t/trt-engin-in-int8-is-much-slower-than-fp16/193755/4
This might be relevant to this discussion: https://forums.developer.nvidia.com/t/trt-engin-in-int8-is-much-slower-than-fp16/193755/4
Setting builder_config.set_flag(trt.BuilderFlag.FP16)
did increase my inference speed a bit but it's a very small increase and slower than fp32. Still expecting my model to be faster. I can see that by default activations are not quantized maybe that's why. Did you end up setting both fp16 and int8 ? Weirdly when I set to fp16 I get slightly faster than when I set to int8 and fp16 both, yet always slower than baseline.
@realAsma Finally got around to testing on a public dataset. Please see the MNIST quantization results in attached notebook. For some reason, the frames per second drop after quantization...
MNIST Benchmarking Notes: Mnistoptimization.md
I used mtq.INT8_default_CFG as recommended for CNN networks (mtq.quantize(model, config, forward_loop). My initial model ran at 80FPS after quantization it dropped to 40FPS? I checked the model structure and it seems like all of my Conv2d layers became QuantConv2d with input and output quantizers as Tensor Quantizer after quantize function call.
Has this been tried for simple FPN or Resnet type models?