Samsung / ONE

On-device Neural Engine
Other
438 stars 157 forks source link

[circle-quantizer] Support GPTQ #13480

Open 01000-you opened 4 months ago

01000-you commented 4 months ago

What

Why

How

Method

Overview _gptq

01000-you commented 3 months ago

Experimental results

  Size     Accuracy        
  NumParams FP32 size INT4 size Baseline PTQ-W8 GPTQ-W8 PTQ-W4 GPTQ-W4
DeiT 5,000,000 19.07MB 4.77MB 0.7202 0.7201 0.7202 0.6466 0.6918
EfficientFormer 12,290,000 46.88MB 11.72MB 0.8018 0.8002 0.8017 0.2023 0.77
ResNet18 11,689,512 44.59MB 11.15MB 0.6976 0.6974 0.6973 0.5821 0.6879
ResNet50 25,557,032 97.49MB 24.37MB 0.7615 0.7607 0.7611 0.5821 0.7557
RegNet400mf 4,344,144 16.57MB 4.14MB 0.7403 0.7395 0.7404 0.3613 0.7194
ResNeXt50 25,028,904 95.48MB 23.87MB 0.7761 0.7758 0.7763 0.6559 0.7686
Wide ResNet50 68,883,240 262.77MB 65.69MB 0.7848 0.7849 0.7847 0.7114 0.7801
Vgg16 138,357,544 527.79MB 131.95MB 0.7159 0.7156 0.7158 0.4644 0.6992
SqueezeNet 1,248,424 4.76MB 1.19MB 0.581 0.5796 0.5803 0.3335 0.5609
ShuffleNet_x0_5 1,366,792 5.21MB 1.30MB 0.6055 0.6021 0.6043 0.1033 0.3634
jinevening commented 3 months ago

The result in https://github.com/Samsung/ONE/issues/13480#issuecomment-2270215801 shows that GPTQ is effective in 4 bit weight quantization. For 8bits, the current PTQ works well for all benchmark models.

Do you have a plan to support 4 bit weight quantization?

seanshpark commented 2 weeks ago

Do you have a plan to support 4 bit weight quantization?

@01000-you , this was asked several months ago. it would help if you provide some information.

seanshpark commented 2 weeks ago

@01000-you , @lemmaa , @jinevening and I had a short talk about this task, and we got some concerns about this work.

What is your future plan with record-hessian tool and adding this feature in circle-quantizer ?

Does this provide practical advantage when used in circle-quantizer with real models like from our VD customers?

seanshpark commented 2 weeks ago

Can https://github.com/Samsung/ONE/issues/13480#issuecomment-2270215801 experiment results can be reproduced with draft #13585 ?

01000-you commented 2 weeks ago

@jinevening I apologize for the delayed response. As you mentioned, there is not much benefit for 8-bit quantization. However, we have considered supporting 4-bit quantization in the future. Recently, weight quantization is challenging even lower than 4-bit quantization. while most models show significant performance degradation when quantized to 4-bit, GPTQ significantly improves this issue. GPTQ algorithm is almost de facto used in the domain of weight quantization. Therefore, we propose that it can be optionally used when 4-bit quantization is supported in the future, while currently working with 8-bit as default.

01000-you commented 2 weeks ago

@seanshpark GPTQ can apply convolutions and FC layers only. As now we only support regular Conv2d and FC layer for any models. The op sets not supported as of now, will covered by in the same way as the circle quantizer. We haven't experimented with the models for VD customers, but if you suggest one we will run experiments.

01000-you commented 2 weeks ago

Can #13480 (comment) experiment results can be reproduced with draft #13585 ?

  • experiment shows result is 4bit quantized. I would like to view the model with Netron.

What we did was fake quantization similar to how QuantizeDeQuantizeWeightPass works. We then evaluated the result using onecc-infer. Therefore, you can only visualize the model with fake-quantized fp32 values in Netron. You can reproduce it using circle-quantizer --quantize_dequantize_weights_with_gptq float32 uint4 channel --config ...

seanshpark commented 2 weeks ago

The op sets not supported as of now, will covered by in the same way as the circle quantizer.

OK. I'll understand GPTQ will quantize for Conv2D and FC and other nodes as existing quantization flows.

Why does compiler/luci/pass/src/QuantizeWeightsWithGPTQPass.cpp file process other Ops?

seanshpark commented 2 weeks ago

What we did was fake quantization similar to how QuantizeDeQuantizeWeightPass works.

So there is no 4bit quantized model?

seanshpark commented 2 weeks ago

You can reproduce it using circle-quantizer --quantize_dequantize_weights_with_gptq float32 uint4 channel --config ...

I'm not good at quantization. plz provide full description.

01000-you commented 2 weeks ago

Why does compiler/luci/pass/src/QuantizeWeightsWithGPTQPass.cpp file process other Ops?

Even if GPTQ is not applied to other layers, weight quantization is necessary in this process. So, QuantizeWeightsWithGPTQPass.cpp conducts the same process as QuantizeDequantizeWeightsPass for other layers.

So there is no 4bit quantized model?

I will send 4bit model via email.

I'm not good at quantization. plz provide full description.

circle-quantizer \
--quantize_dequantize_weights_with_gptq float32 uint4 channel \
<input_model_path> <output_model_path> \
--input_data <input_data_path>
seanshpark commented 6 days ago

I will send 4bit model via email.

You don't need to send to email. I don't want to make addition channels for discussion.

circle-quantizer \
--quantize_dequantize_weights_with_gptq float32 uint4 channel \
<input_model_path> <output_model_path> \
--input_data <input_data_path>

I'd like to try with well known models with current draft #13585.

  1. please rebase the draft to current head(master branch)
  2. I'd like to try with Inception v3 model. I'll add the results here.
  3. there is --input_data <input_data_path> . how do I produce this data file for IV3?
  4. How do I know GPTQ is working correctly?

for any issue, please add a comment.


IV3 model from https://www.tensorflow.org/lite/guide/hosted_models?hl=ko