Aaronhuang-778 / BiLLM

(ICML 2024) BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
https://arxiv.org/abs/2402.04291
MIT License
155 stars 12 forks source link

scale factor and bit storing calculation #15

Open kaizizzzzzz opened 4 weeks ago

kaizizzzzzz commented 4 weeks ago

Hi, there are two questions about this paper

  1. scaling factor

When looking at the code, I'm a little bit confused about the scaling factor. using llama2‘s 4096 hidden dimension and block_size=128 for example:

In the code, the salient, non-salient1, non-salient2 in a block(4096x128) are scaled in high_order_residual. And each one has a 4096x1 scaling factor. So the total scaling factor for a 4096x4096 matrix is 3x4096x(4096/128)?

image
  1. what's the meaning of the storing bit and why the average is calculated like this? image

Thanks!

kaizizzzzzz commented 4 weeks ago

If we use such a high granularity of scaling factor for acceleration, is it possible to do the computation, such as GEMM in low-bit, and then dequantize to actual value using scaling factor?