Open agoncharenko1992 opened 3 weeks ago
Hi, sorry for the late reply.
Thank you very much for your interest in our work!
In formula (3), α and β are learnable parameters. We initialize α and β before training with formula (4) and (5) to accelerate the model's convergence speed. So, we only use formula (4) and (5) once to calculate α and β before training. During training, α and β are updated along with other model parameters based on the gradients.
Therefore, I believe that the formulas in the paper are consistent with the code implementation ^_^
Thank you for reply! :hand_with_index_finger_and_thumb_crossed:
I understand that α and β are learnable parameters and the main question is in initialization. And I am a little bit confused:
So actually \beta_j is scale factor. Am I correct?
Hi, thanks for the work!
I have a question regarding the calculation of alpha and beta coefficients (wscale and wbias in terms of code).
In the article, their application and calculations are performed using formula (3), (4), (5). However, in the code, they are mixed up.
Is it error in the paper? I suppose so but asking just in case.