Open tmagcaya opened 1 month ago
ModelOpt quantization is fake quantization, which means it only simulates the low-precision computation in PyTorch. Real speedup and memory saving should be achieved by exporting the model to deployment frameworks. Learn more here
In this case, the model is a CNN model - the quantized model should be deployed to TensorRT.
Please see an example for exporting to ONNX and deploying to TensorRT here - https://github.com/NVIDIA/TensorRT/tree/main/samples/python/detectron2
Got it, is there a difference between tensorrt quantization vs this type of quantization for CNN models then? I think for CNN we can only use the int8_default config quantization method.
The simulated quantization from modelopt allows:
I have yet to try QAT but for PTQ here are there results I got. I'm confused, can you help me understand why int8 had a lower fps than fp16 even though it was quantized down to half precision.
Model without compression: 21.8 mAP, 40.1 fps Model with TensorRT: 21.8 mAP, 75.8 fps Model with TRT+fp16 through engine config params no modelopt: 0.15 mAP, 132 fps Model with modelopt_int8+TRT: 0.1 mAP, 95 fps
@tmagcaya could you share your model architecture so that we can reproduce your result?
My apologies on the performance degradation. modelopt.torch.quantization speedup analysis have been focused on LLMs (deployed via TensorRT-LLM) and diffusion models (deployed via TRT).
It is quite possible for quantized models other than the LLMs or diffusion models to be slower than TRT's un-quantized baseline models. In this case, we recommend exporting the Pytorch model to ONNX first and then quantize the ONNX graph via modelopt.onnx.quantization. Please see examples for quantizing ONNX graph here.
Could you please try out quantizing the ONNX graph using modelopt.onnx.quantization
instead?
However one limitation of this approach is that modelopt.onnx.quantization
does not support QAT. I recommend trying out PTQ first with modelopt.onnx.quantization
first - if the accuracy is not bad, QAT would not be needed (QAT could improve accuracy, but does not give additional speedup over PTQ).
I'm sorry the architecture is proprietary, but I'll try to replicate the issue on an open architecture if I can find some time in the next few weeks.
I used mtq.INT8_default_CFG as recommended for CNN networks (mtq.quantize(model, config, forward_loop). My initial model ran at 80FPS after quantization it dropped to 40FPS? I checked the model structure and it seems like all of my Conv2d layers became QuantConv2d with input and output quantizers as Tensor Quantizer after quantize function call.
Has this been tried for simple FPN or Resnet type models?