Save the quantized YoloNAS model

Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.

https://www.supergradients.com

Apache License 2.0

4.54k stars 496 forks source link

Save the quantized YoloNAS model #1914

Closed james-imi closed 5 months ago

james-imi commented 7 months ago

💡 Your Question

After doing quantization

q_util = SelectiveQuantizer(
    default_quant_modules_calibrator_weights="max",
    default_quant_modules_calibrator_inputs="histogram",
    default_per_channel_quant_weights=True,
    default_learn_amax=False,
    verbose=True,

)

q_util.quantize_module(model)

How do I save the quantized model (not as ONNX)?

Versions

No response

BloodAxe commented 6 months ago

Why would you want to save it in the first place? I believe you can get model state using model.state_dict()

james-imi commented 6 months ago

So you dont have to quantize it every time you load it?

BloodAxe commented 6 months ago

You don't want to use eager pytorch inference mode on quantized model. It would be horribly slow doing inference this way. I mean you can, and it work but I strongly suggest not doing it this way.

A quantized model in pytorch actually doing "fake quantization" where weights are stored as floats, and additional quantize/dequantize layers are added on top of that to "pretend" a model is quantized. This is necessary to achieve quantization-aware training or model calibration. But in reality you want to export a final quantized model to ONNX and TRT or OpenVINO that knows how to handle such model and build an optimized quantized model from it.

james-imi commented 6 months ago

@BloodAxe Hi thanks for the info. so using super gradients way of predict() is not a go-to for local CPU server inference, is that it?

For quantization-aware training, what would be the recommended steps? Is it still okay to to QAT then use super-gradients' predict for this?

BloodAxe commented 5 months ago

It would work, but it is not efficient way of inferencing quantized model. Please check these notebooks to see how you can use TensorRT or ONNXRuntime for model inference: https://github.com/Deci-AI/super-gradients/blob/03c445c0cc42743c66aa166b2e47a11a7cfc0eda/notebooks/YoloNAS_Inference_using_TensorRT.ipynb https://github.com/Deci-AI/super-gradients/blob/03c445c0cc42743c66aa166b2e47a11a7cfc0eda/src/super_gradients/examples/model_export/models_export.ipynb

james-imi commented 5 months ago

@BloodAxe any function in the repo for post processing ONNX results?