[circle-quantizer] Minimum MSE Quantization

SlavikMIPT commented 7 months ago

What

Let's support calculating of scale using MSE minimization (see part 3.4).

Why

We can improve accuracy of quantized models (especially low-bit quantized) by finding optimal approximation of tensor (calculating scale factor using MSE minimization approach)

How

TBD:

SlavikMIPT commented 7 months ago

@Torrero

SlavikMIPT commented 7 months ago

@jinevening Can you please give your opinion, about this feature?

jinevening commented 7 months ago

Could you explain the background of this work? Are you going to support int4 quantization? Or is it for better 8 bit quantization?

SlavikMIPT commented 7 months ago

Could you explain the background of this work? Are you going to support int4 quantization? Or is it for better 8 bit quantization?

I am planning to support int4, but this method can be applied to 8 bit quantization and improve accuracy

jinevening commented 7 months ago

So, your main goal is to implement int4 quantization. May I ask what is its use case? What is the target backend?

SlavikMIPT commented 7 months ago

So, your main goal is to implement int4 quantization. May I ask what is its use case? What is the target backend?

Microcontrollers - int4 will allow to reduce binary size of models

jinevening commented 7 months ago

Do you have any target model or results about accuracy? one-quantize's quantization algorithm does not work well in int4.

SlavikMIPT commented 5 months ago

Do you have any target model or results about accuracy? one-quantize's quantization algorithm does not work well in int4.

I made some tests - here is result:

I quantized following float weights array for POC:

[0.0027225911617279053, 0.18983474373817444, 0.41336789727211, -0.18240013718605042, 0.2007804811000824,-0.19718044996261597, -0.06138917803764343, -0.12958911061286926,-0.12484398484230042, -0.4296090304851532,-0.3490271270275116, 0.3468421995639801, 0.3684578835964203, 0.18269559741020203, 0.23875799775123596, -0.2323986440896988]

Current quantization algorithm gives following scale and values for this channel:

int4:

q = 0.06137271970510483 [0, 3, 7, -3, 3, -3, -1, -2, -2, -7, -6, 6, 6, 3, 4, -4]

int8:

q = 0.0033827482257038355 [1, 56, 122, -54, 59, -58, -18, -38, -37, -127, -103, 103, 109, 54, 71, -69]

If we multiply q and val and calculate MSE with reference float array:

int4_mse = 0.010758214646791673 int8_mse = 0.0008263211069635848

Using MSE minimization approach I got following scale and values:

int4:

q = 0.060275191359340924 [0, 3, 7, -3, 3, -3, -1, -2, -2, -7, -6, 6, 6, 3, 4, -4]

int8:

q = 0.003798802578209628 [1, 50, 109, -48, 53, -52, -16, -34, -33, -113, -92, 91, 97, 48, 63, -61]

int4_mse_opt = 0.009682772748301009 int8_mse_opt = 0.0005851636840414105

So we got more precise approximation for free and difference in this case is more significant for 8 bit quantization (41% MSE improvement), for 4 bit quantization MSE improvement is 11% for this test model

SlavikMIPT commented 5 months ago

I think we should implement first this optimization algorithm in quantizer - it significantly improves precision of values representation without overhead

SlavikMIPT commented 5 months ago

Tested #12582 on tflite micro hello_world example model, got following results:		average deviation % (compared to float model)
int8 current	0.34%
int8 mse	0.48%
int4 current	8.3%
int4 mse	4%

As we can see - for int8 we got some degradation on this model(but on some models I got improvement), for int4 - I got significant improvement using minimum mse quantization approach. So I offer to add this algorithm as an option as a first step. @jinevening what do you think?

jinevening commented 5 months ago

It looks fine to add MSE algorithm, but we may need some refactoring.

It seems that #12582 does not implement int4, how did you test int4?

SlavikMIPT commented 5 months ago

It looks fine to add MSE algorithm, but we may need some refactoring.

It seems that #12582 does not implement int4, how did you test int4?

I hardcoded int8 version and limited it with 4bit for proof of concept

jinevening commented 5 months ago

Could you make a full draft before landing PRs? That is how we typically work when introducing a new feature.

First, make a full draft without considering code quality too much and measure the benefit from the new feature. Here, "full draft" means that others can reproduce the result with the draft. "benefit" in this case would be the performance improvement on microcontroller.

Second, discuss how to land the draft.

Third, review and merge.

jinevening commented 5 months ago

Microcontrollers - int4 will allow to reduce binary size of models

CC @chunseoklee

Samsung / ONE