Closed yyfcc17 closed 9 months ago
As you stated we can only use int in cutlass. After max-min asymmetric quantization of the input (zero point is the smallest element in the input) we get uint so we have to shift the resulting uints to int.
thanks for your reply.
so the remain question is:
can we use int instead of uint for activation asysmetric quantization? zero point doesn't have to represent the smallest element, but some value in between, so we don't need to introduce the half range shift at all?
Yes, it is possible. One just needs to change the activation quantization/dequantization code. The shift has a negligible effect on the latency so you only suggest to get rid of the constant shift in sake of simplicity?
when you use int to do activation quantization instead of uint, the halfRange shift is not needed, therefore simplify the inference process, and of course also make the inference faster.
what do you mean by "The shift has a negligible effect on the latency"? does it make the inference faster or slower?
I mean that one won't notice the difference of the inference latency with/without the shift. I agree that it would simplify the process.
i see, the halfRange makes the math a little complicated, thanks for the clarification.
as i can understand, the half range is to convert uint to int or int to uint, but we can use only int, even under the asymetric quant setting
we can use int4/int8 for both symetric and asymetric quantization, the difference is whether the zero_point is set to zero or not.
in the basic equation,
f = s(q - z)
q can be int4/int8 in asymetric quant, as long as z is not zero, we can quant f to [-8, 7] / [-128, 127]
if we use only int, it will make the kernel implementation easier, and easier to understand.
is there any special consideration to use uint?