When looking at the code, I'm a little bit confused about the scaling factor. using llama2‘s 4096 hidden dimension and block_size=128 for example:
In the code, the salient, non-salient1, non-salient2 in a block(4096x128) are scaled in high_order_residual. And each one has a 4096x1 scaling factor. So the total scaling factor for a 4096x4096 matrix is 3x4096x(4096/128)?
what's the meaning of the storing bit and why the average is calculated like this?
If we use such a high granularity of scaling factor for acceleration, is it possible to do the computation, such as GEMM in low-bit, and then dequantize to actual value using scaling factor?
Hi, there are two questions about this paper
When looking at the code, I'm a little bit confused about the scaling factor. using llama2‘s 4096 hidden dimension and block_size=128 for example:
In the code, the salient, non-salient1, non-salient2 in a block(4096x128) are scaled in
high_order_residual
. And each one has a 4096x1 scaling factor. So the total scaling factor for a 4096x4096 matrix is 3x4096x(4096/128)?Thanks!