IST-DASLab / QUIK

Repository for the QUIK project, enabling the use of 4bit kernels for generative inference
Apache License 2.0
167 stars 12 forks source link

why there is a half range shift? #9

Closed yyfcc17 closed 9 months ago

yyfcc17 commented 10 months ago

as i can understand, the half range is to convert uint to int or int to uint, but we can use only int, even under the asymetric quant setting

we can use int4/int8 for both symetric and asymetric quantization, the difference is whether the zero_point is set to zero or not.

in the basic equation,

f = s(q - z)

q can be int4/int8 in asymetric quant, as long as z is not zero, we can quant f to [-8, 7] / [-128, 127]

if we use only int, it will make the kernel implementation easier, and easier to understand.

is there any special consideration to use uint?

ilmarkov commented 10 months ago

As you stated we can only use int in cutlass. After max-min asymmetric quantization of the input (zero point is the smallest element in the input) we get uint so we have to shift the resulting uints to int.

yyfcc17 commented 10 months ago

thanks for your reply.

so the remain question is:

can we use int instead of uint for activation asysmetric quantization? zero point doesn't have to represent the smallest element, but some value in between, so we don't need to introduce the half range shift at all?

ilmarkov commented 9 months ago

Yes, it is possible. One just needs to change the activation quantization/dequantization code. The shift has a negligible effect on the latency so you only suggest to get rid of the constant shift in sake of simplicity?

yyfcc17 commented 9 months ago

when you use int to do activation quantization instead of uint, the halfRange shift is not needed, therefore simplify the inference process, and of course also make the inference faster.

what do you mean by "The shift has a negligible effect on the latency"? does it make the inference faster or slower?

ilmarkov commented 9 months ago

I mean that one won't notice the difference of the inference latency with/without the shift. I agree that it would simplify the process.

yyfcc17 commented 9 months ago

i see, the halfRange makes the math a little complicated, thanks for the clarification.