[circle-quantizer] Support GPTQ

01000-you commented 4 months ago

What

We propose supporting the GPTQ algorithm, a state-of-the-art post-training quantization (PTQ) method that has demonstrated robust performance, effectively compressing weights. Notably, GPTQ shows significant efficacy with quantization levels down to 4 bits and even 3 bits on a group-wise basis.

Paper: https://arxiv.org/abs/2210.17323

Why

The upcoming AI models support weight formats of up to 4 bits. To align with these changes, we need to integrate high-performance quantization algorithms such as GPTQ.

How

The implementation will involve converting the Python code provided by the paper's authors into C++, and integrating it with the existing ONE codebase. The current processes in record-minmax and QuantDequant weightPass offer a procedural foundation suitable for the GPTQ algorithm which compute the Hessian via input and update weights to compensate for quantization errors.

Method

The Hessian matrix is computed by averaging the values from 2X*X^T for every 100 input samples.
Weight updating and quantization are executed row by row with compensation applied to each row to address quantization errors. This adjustment is carried out on a layer-by-layer basis.

Overview _gptq

01000-you commented 3 months ago

Experimental results

The models below are vision classification models provided by PyTorch.
We evaluated only using 4/8 bit weight quantization.
Baseline : evaluated with FP32 model
PTQ : original Post-training Quantization using QuantizeDequantizeWeightsPass

	Size			Accuracy
	NumParams	FP32 size	INT4 size	Baseline	PTQ-W8	GPTQ-W8	PTQ-W4	GPTQ-W4
DeiT	5,000,000	19.07MB	4.77MB	0.7202	0.7201	0.7202	0.6466	0.6918
EfficientFormer	12,290,000	46.88MB	11.72MB	0.8018	0.8002	0.8017	0.2023	0.77
ResNet18	11,689,512	44.59MB	11.15MB	0.6976	0.6974	0.6973	0.5821	0.6879
ResNet50	25,557,032	97.49MB	24.37MB	0.7615	0.7607	0.7611	0.5821	0.7557
RegNet400mf	4,344,144	16.57MB	4.14MB	0.7403	0.7395	0.7404	0.3613	0.7194
ResNeXt50	25,028,904	95.48MB	23.87MB	0.7761	0.7758	0.7763	0.6559	0.7686
Wide ResNet50	68,883,240	262.77MB	65.69MB	0.7848	0.7849	0.7847	0.7114	0.7801
Vgg16	138,357,544	527.79MB	131.95MB	0.7159	0.7156	0.7158	0.4644	0.6992
SqueezeNet	1,248,424	4.76MB	1.19MB	0.581	0.5796	0.5803	0.3335	0.5609
ShuffleNet_x0_5	1,366,792	5.21MB	1.30MB	0.6055	0.6021	0.6043	0.1033	0.3634

jinevening commented 3 months ago

The result in https://github.com/Samsung/ONE/issues/13480#issuecomment-2270215801 shows that GPTQ is effective in 4 bit weight quantization. For 8bits, the current PTQ works well for all benchmark models.

Do you have a plan to support 4 bit weight quantization?

seanshpark commented 2 weeks ago

Do you have a plan to support 4 bit weight quantization?

@01000-you , this was asked several months ago. it would help if you provide some information.

seanshpark commented 2 weeks ago

@01000-you , @lemmaa , @jinevening and I had a short talk about this task, and we got some concerns about this work.

What is your future plan with record-hessian tool and adding this feature in circle-quantizer ?

can we quantize all the Ops?
or is it for particular model?
what is your plan for Ops that are not supported as of now?

Does this provide practical advantage when used in circle-quantizer with real models like from our VD customers?

seanshpark commented 2 weeks ago

Can https://github.com/Samsung/ONE/issues/13480#issuecomment-2270215801 experiment results can be reproduced with draft #13585 ?

experiment shows result is 4bit quantized. I would like to view the model with Netron.

01000-you commented 2 weeks ago

@jinevening I apologize for the delayed response. As you mentioned, there is not much benefit for 8-bit quantization. However, we have considered supporting 4-bit quantization in the future. Recently, weight quantization is challenging even lower than 4-bit quantization. while most models show significant performance degradation when quantized to 4-bit, GPTQ significantly improves this issue. GPTQ algorithm is almost de facto used in the domain of weight quantization. Therefore, we propose that it can be optionally used when 4-bit quantization is supported in the future, while currently working with 8-bit as default.

01000-you commented 2 weeks ago

@seanshpark GPTQ can apply convolutions and FC layers only. As now we only support regular Conv2d and FC layer for any models. The op sets not supported as of now, will covered by in the same way as the circle quantizer. We haven't experimented with the models for VD customers, but if you suggest one we will run experiments.

01000-you commented 2 weeks ago

Can #13480 (comment) experiment results can be reproduced with draft #13585 ?

experiment shows result is 4bit quantized. I would like to view the model with Netron.

What we did was fake quantization similar to how QuantizeDeQuantizeWeightPass works. We then evaluated the result using onecc-infer. Therefore, you can only visualize the model with fake-quantized fp32 values in Netron. You can reproduce it using circle-quantizer --quantize_dequantize_weights_with_gptq float32 uint4 channel --config ...

seanshpark commented 2 weeks ago

The op sets not supported as of now, will covered by in the same way as the circle quantizer.

OK. I'll understand GPTQ will quantize for Conv2D and FC and other nodes as existing quantization flows.

Why does compiler/luci/pass/src/QuantizeWeightsWithGPTQPass.cpp file process other Ops?

seanshpark commented 2 weeks ago

What we did was fake quantization similar to how QuantizeDeQuantizeWeightPass works.

So there is no 4bit quantized model?

seanshpark commented 2 weeks ago

You can reproduce it using circle-quantizer --quantize_dequantize_weights_with_gptq float32 uint4 channel --config ...

I'm not good at quantization. plz provide full description.

01000-you commented 2 weeks ago

Why does compiler/luci/pass/src/QuantizeWeightsWithGPTQPass.cpp file process other Ops?

Even if GPTQ is not applied to other layers, weight quantization is necessary in this process. So, QuantizeWeightsWithGPTQPass.cpp conducts the same process as QuantizeDequantizeWeightsPass for other layers.

So there is no 4bit quantized model?

I will send 4bit model via email.

I'm not good at quantization. plz provide full description.

circle-quantizer \
--quantize_dequantize_weights_with_gptq float32 uint4 channel \
<input_model_path> <output_model_path> \
--input_data <input_data_path>

seanshpark commented 6 days ago

I will send 4bit model via email.

You don't need to send to email. I don't want to make addition channels for discussion.

circle-quantizer \
--quantize_dequantize_weights_with_gptq float32 uint4 channel \
<input_model_path> <output_model_path> \
--input_data <input_data_path>

I'd like to try with well known models with current draft #13585.

please rebase the draft to current head(master branch)
I'd like to try with Inception v3 model. I'll add the results here.
there is --input_data <input_data_path> . how do I produce this data file for IV3?
How do I know GPTQ is working correctly?

for any issue, please add a comment.

IV3 model from https://www.tensorflow.org/lite/guide/hosted_models?hl=ko

inception_v3_2018_04_27.tgz

Samsung / ONE

[circle-quantizer] Support GPTQ #13480

Experimental results