LiqunMa / FBI-LLM

FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
44 stars 2 forks source link

Correct calculation of wscale and wbias. #1

Open agoncharenko1992 opened 3 weeks ago

agoncharenko1992 commented 3 weeks ago

Hi, thanks for the work!

I have a question regarding the calculation of alpha and beta coefficients (wscale and wbias in terms of code).

In the article, their application and calculations are performed using formula (3), (4), (5). However, in the code, they are mixed up.

Is it error in the paper? I suppose so but asking just in case.

LiqunMa commented 1 week ago

Hi, sorry for the late reply.

Thank you very much for your interest in our work!

In formula (3), α and β are learnable parameters. We initialize α and β before training with formula (4) and (5) to accelerate the model's convergence speed. So, we only use formula (4) and (5) once to calculate α and β before training. During training, α and β are updated along with other model parameters based on the gradients.

Therefore, I believe that the formulas in the paper are consistent with the code implementation ^_^

agoncharenko1992 commented 6 days ago

Thank you for reply! :hand_with_index_finger_and_thumb_crossed:

I understand that α and β are learnable parameters and the main question is in initialization. And I am a little bit confused:

  1. In formula (4) you calculated \betaj = \frac{1}{m}\sum{i}^{m}|W{i,j}^{f} - a{j}| where a_{j} is column-wise mean.
  2. But In code you called such parameter wscale.
  3. In formula (3) you add \beta_{j} to other term.
  4. In code you use previously calculated wscale as multiplication factor.

So actually \beta_j is scale factor. Am I correct?