Understanding Quantization results

lpkoh commented 1 year ago

💡 Your Question

Hi,

I am just checking, I see in the provided results that Yolo-NAS-L does not suffer much reduction in performance going to Yolo-NAS-INT8-L. Can I check what exactly is meant by Yolo-NAS-L and Yolo-NAS-INT8-L? Like is Yolo-NAS-L a model trained wtih quantization awareness, and then what happens is that it has some layers quantized and others not. Then Yolo-NAS-INT8-L is the results of converting that model through quantization, but the reduction in performance is negligible due to the quantization aware training that distributes the quantization in a certain way?

Also, what is the simplest way to do quantization aware training on my own dataset? Is it doable through a recipe or no?

Versions

No response

shaydeci commented 1 year ago

Yolo-NAS-INT8 variants are quantized with post-training-quantization. We have added plenty of material on how to perform quantization in SG: QAT and PTQ for YoloNAS QAT and PTQ Quantization aware training YoloNAS on custom dataset notebook

lpkoh commented 1 year ago

@shaydeci Thank you so much for the response, they pretty much explain exactly how one would do it.

Can I however, understand why QAT is applied after PTQ? It says so in the notebook: In essence, quantization-aware training is training of the model after PTQ. It can be looked as finetuning of the model with newly-accustomed Q/DQ layers.

When I asked ChatGPT: Why is quantization aware training done after post training quantization? Shouldn't it be during the training itself?

It said: Your understanding is correct that Quantization-Aware Training (QAT) typically happens during the training process itself. QAT simulates the effects of quantization (which is applied after training in Post-Training Quantization) during the training of the neural network itself, so the network can learn to compensate for the effects of quantization and maintain accuracy.

However, in the description you provided, it seems like a Post-Training Quantization (PTQ) is performed first, followed by Quantization-Aware Training (QAT). This sequence is indeed unusual and a bit counter-intuitive, but it's not impossible to have a reason behind it.

One possible interpretation could be that the initial PTQ acts as a sort of "rough cut" to reduce the model's size and computational requirements, with the QAT then fine-tuning the already-quantized model to improve its accuracy. Essentially, the PTQ might be used to get a "quick and dirty" quantized model up and running as soon as possible, with the QAT then being used to refine this initial quantized model.

However, this would not be the typical use case. Usually, you would either use PTQ or QAT, not both. PTQ is simpler and easier to apply, but QAT generally yields better performance because it takes the effects of quantization into account during training.

Also, on a practical level, the notebook describes a normal training -> PTQ -> QAT pipeline. Do you have recommendations for 'normal training' and 'QAT'? I observe it uses the same datasets. But should we have same epochs? Should the first stage of normal training use relatively fewer epochs?

lpkoh commented 1 year ago

Also, can I clarify, suppose I go through PTQ and QAT and obtain: /content/sg_checkpoints_dir/yolo_nas_s_soccer_players/yolo_nas_s_soccer_players_32x3x640x640_ptq.onnx and /content/sg_checkpoints_dir/yolo_nas_s_soccer_players/yolo_nas_s_soccer_players_32x3x640x640_qat.onnx". When I convert these models to tensorrt, should I still be converting using --fp16 or --int8 flag? Like does that do anything, or does PTQ and QAT already implicitly do that.

Deci-AI / super-gradients

Understanding Quantization results #1158

💡 Your Question

Versions