Open llCurious opened 2 years ago
Hi @llCurious , thanks for your interest in our work and sorry for late reply. You are right, the input to the quantization function q_k is always normalized to [0, 1]
(see weight normalization and input normalization ) and its output is also quantized value in [0, 1]
(see here). Afterwards the quantized value will be scaled back by dequantization for input (see input rescaling) or weight if rescale
is True
(see weight rescaling).
Thanks for your replay.
Since the raw weight/input may have much larger range. E.g. the input (maybe the output of some FC layer) can be significantly large if the size of the neuron for this layer is like 128 and the data dimension is 1,000. In this case, after normalization, the magnitude seems to change a lot.
In addition, could you elaborate on the backward for such quantization or point out if there is elaboration in your paper?
I also read some related papers that work on quantization. They seem to use cliping rather than normalization to constrain the input range. Why do you choose such scheme?
Hi @llCurious , I do not see difference between denormalization and dequantization. We also clip the input before scaling and quantization.
q_k
is firstly normalized into [-1,1] using non-linear transformation weight = torch.tanh(self.weight) / torch.max(torch.abs(torch.tanh(self.weight)))
. Do you mean the de-quantization you mentioned above is used to erase the effect of this step (normalization to me)? By the way, what is the underlying data type for the whole quantization procedure? It seems to be f32
rather than int8
. The de-quantization above also multiply the quantized weight by weight_scale
which is a float number as well.
Hey, your work is well-presented and i just wonder one detail:
How do you ensure that the input to your quantization function is in the range [0,1]?
As you mentioned in
models/quant_ops
(this link), do you require that the input is normalized in advance?