Open 01000-you opened 4 months ago
QuantizeDequantizeWeightsPass
Size | Accuracy | |||||||
---|---|---|---|---|---|---|---|---|
NumParams | FP32 size | INT4 size | Baseline | PTQ-W8 | GPTQ-W8 | PTQ-W4 | GPTQ-W4 | |
DeiT | 5,000,000 | 19.07MB | 4.77MB | 0.7202 | 0.7201 | 0.7202 | 0.6466 | 0.6918 |
EfficientFormer | 12,290,000 | 46.88MB | 11.72MB | 0.8018 | 0.8002 | 0.8017 | 0.2023 | 0.77 |
ResNet18 | 11,689,512 | 44.59MB | 11.15MB | 0.6976 | 0.6974 | 0.6973 | 0.5821 | 0.6879 |
ResNet50 | 25,557,032 | 97.49MB | 24.37MB | 0.7615 | 0.7607 | 0.7611 | 0.5821 | 0.7557 |
RegNet400mf | 4,344,144 | 16.57MB | 4.14MB | 0.7403 | 0.7395 | 0.7404 | 0.3613 | 0.7194 |
ResNeXt50 | 25,028,904 | 95.48MB | 23.87MB | 0.7761 | 0.7758 | 0.7763 | 0.6559 | 0.7686 |
Wide ResNet50 | 68,883,240 | 262.77MB | 65.69MB | 0.7848 | 0.7849 | 0.7847 | 0.7114 | 0.7801 |
Vgg16 | 138,357,544 | 527.79MB | 131.95MB | 0.7159 | 0.7156 | 0.7158 | 0.4644 | 0.6992 |
SqueezeNet | 1,248,424 | 4.76MB | 1.19MB | 0.581 | 0.5796 | 0.5803 | 0.3335 | 0.5609 |
ShuffleNet_x0_5 | 1,366,792 | 5.21MB | 1.30MB | 0.6055 | 0.6021 | 0.6043 | 0.1033 | 0.3634 |
The result in https://github.com/Samsung/ONE/issues/13480#issuecomment-2270215801 shows that GPTQ is effective in 4 bit weight quantization. For 8bits, the current PTQ works well for all benchmark models.
Do you have a plan to support 4 bit weight quantization?
Do you have a plan to support 4 bit weight quantization?
@01000-you , this was asked several months ago. it would help if you provide some information.
@01000-you , @lemmaa , @jinevening and I had a short talk about this task, and we got some concerns about this work.
What is your future plan with record-hessian
tool and adding this feature in circle-quantizer
?
Does this provide practical advantage when used in circle-quantizer
with real models like from our VD customers?
Can https://github.com/Samsung/ONE/issues/13480#issuecomment-2270215801 experiment results can be reproduced with draft #13585 ?
@jinevening I apologize for the delayed response. As you mentioned, there is not much benefit for 8-bit quantization. However, we have considered supporting 4-bit quantization in the future. Recently, weight quantization is challenging even lower than 4-bit quantization. while most models show significant performance degradation when quantized to 4-bit, GPTQ significantly improves this issue. GPTQ algorithm is almost de facto used in the domain of weight quantization. Therefore, we propose that it can be optionally used when 4-bit quantization is supported in the future, while currently working with 8-bit as default.
@seanshpark GPTQ can apply convolutions and FC layers only. As now we only support regular Conv2d and FC layer for any models. The op sets not supported as of now, will covered by in the same way as the circle quantizer. We haven't experimented with the models for VD customers, but if you suggest one we will run experiments.
Can #13480 (comment) experiment results can be reproduced with draft #13585 ?
- experiment shows result is 4bit quantized. I would like to view the model with Netron.
What we did was fake quantization similar to how QuantizeDeQuantizeWeightPass
works.
We then evaluated the result using onecc-infer
. Therefore, you can only visualize the model with fake-quantized fp32 values in Netron.
You can reproduce it using circle-quantizer --quantize_dequantize_weights_with_gptq float32 uint4 channel --config ...
The op sets not supported as of now, will covered by in the same way as the circle quantizer.
OK. I'll understand GPTQ will quantize for Conv2D and FC and other nodes as existing quantization flows.
Why does compiler/luci/pass/src/QuantizeWeightsWithGPTQPass.cpp
file process other Ops?
What we did was fake quantization similar to how QuantizeDeQuantizeWeightPass works.
So there is no 4bit quantized model?
You can reproduce it using circle-quantizer --quantize_dequantize_weights_with_gptq float32 uint4 channel --config ...
I'm not good at quantization. plz provide full description.
Why does compiler/luci/pass/src/QuantizeWeightsWithGPTQPass.cpp file process other Ops?
Even if GPTQ is not applied to other layers, weight quantization is necessary in this process.
So, QuantizeWeightsWithGPTQPass.cpp
conducts the same process as QuantizeDequantizeWeightsPass
for other layers.
So there is no 4bit quantized model?
I will send 4bit model via email.
I'm not good at quantization. plz provide full description.
circle-quantizer \
--quantize_dequantize_weights_with_gptq float32 uint4 channel \
<input_model_path> <output_model_path> \
--input_data <input_data_path>
I will send 4bit model via email.
You don't need to send to email. I don't want to make addition channels for discussion.
circle-quantizer \
--quantize_dequantize_weights_with_gptq float32 uint4 channel \
<input_model_path> <output_model_path> \
--input_data <input_data_path>
I'd like to try with well known models with current draft #13585.
--input_data <input_data_path>
. how do I produce this data file for IV3?for any issue, please add a comment.
IV3 model from https://www.tensorflow.org/lite/guide/hosted_models?hl=ko
What
We propose supporting the GPTQ algorithm, a state-of-the-art post-training quantization (PTQ) method that has demonstrated robust performance, effectively compressing weights. Notably, GPTQ shows significant efficacy with quantization levels down to 4 bits and even 3 bits on a group-wise basis.
Paper: https://arxiv.org/abs/2210.17323
Why
How
Method
Overview