Closed james-imi closed 5 months ago
Why would you want to save it in the first place?
I believe you can get model state using model.state_dict()
So you dont have to quantize it every time you load it?
You don't want to use eager pytorch inference mode on quantized model. It would be horribly slow doing inference this way. I mean you can, and it work but I strongly suggest not doing it this way.
A quantized model in pytorch actually doing "fake quantization" where weights are stored as floats, and additional quantize/dequantize layers are added on top of that to "pretend" a model is quantized. This is necessary to achieve quantization-aware training or model calibration. But in reality you want to export a final quantized model to ONNX and TRT or OpenVINO that knows how to handle such model and build an optimized quantized model from it.
@BloodAxe Hi thanks for the info. so using super gradients way of predict()
is not a go-to for local CPU server inference, is that it?
For quantization-aware training, what would be the recommended steps? Is it still okay to to QAT then use super-gradients' predict
for this?
It would work, but it is not efficient way of inferencing quantized model. Please check these notebooks to see how you can use TensorRT or ONNXRuntime for model inference: https://github.com/Deci-AI/super-gradients/blob/03c445c0cc42743c66aa166b2e47a11a7cfc0eda/notebooks/YoloNAS_Inference_using_TensorRT.ipynb https://github.com/Deci-AI/super-gradients/blob/03c445c0cc42743c66aa166b2e47a11a7cfc0eda/src/super_gradients/examples/model_export/models_export.ipynb
@BloodAxe any function in the repo for post processing ONNX results?
💡 Your Question
After doing quantization
How do I save the quantized model (not as ONNX)?
Versions
No response