About the cached layernorm scale factors

Meehaohao commented 1 month ago

Question

About the 'apply_ln_to_stack' function in 'ActivationCache.py' file, what's the meaning about the following sentence？ "The layernorm scale is global across the entire residual stream for each layer, batch element and position, which is why we need to use the cached scale factors rather than just applying a new LayerNorm."

That is, why we need to use the cached scale factors（In layer_norm, the scale factor is mean and std）of the original reasoning，rather than normalizing the variables directly in different layers?

neelnanda-io commented 1 month ago

Because we want to use the scale factor of the FINAL residual stream to scale COMPONENTS of the residual stream, and you can't infer the final norm from partial components

On Wed, 7 Aug 2024, 05:20 Mi Hao, @.***> wrote:

Question

About the 'apply_ln_to_stack' function in 'ActivationCache.py' file, what's the meaning about: " The layernorm scale is global across the entire residual stream for each layer, batch element and position, which is why we need to use the cached scale factors rather than just applying a new LayerNorm." ?

That is, why we need to use the cached scale factors（In layer_norm, the scale factor is mean and std）of the original reasoning， instead of normalizing the variables directly?

— Reply to this email directly, view it on GitHub https://github.com/TransformerLensOrg/TransformerLens/issues/696, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKLH6L6IYJUPB3FWUWDZQIGKPAVCNFSM6AAAAABMEJKWBWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMZUG44DGOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Meehaohao commented 1 month ago

Because we want to use the scale factor of the FINAL residual stream to scale COMPONENTS of the residual stream, and you can't infer the final norm from partial components … On Wed, 7 Aug 2024, 05:20 Mi Hao, @.> wrote: Question About the 'apply_ln_to_stack' function in 'ActivationCache.py' file, what's the meaning about: " The layernorm scale is global across the entire residual stream for each layer, batch element and position, which is why we need to use the cached scale factors rather than just applying a new LayerNorm." ? That is, why we need to use the cached scale factors（In layer_norm, the scale factor is mean and std）of the original reasoning， instead of normalizing the variables directly? — Reply to this email directly, view it on GitHub <#696>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKLH6L6IYJUPB3FWUWDZQIGKPAVCNFSM6AAAAABMEJKWBWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMZUG44DGOA . You are receiving this because you are subscribed to this thread.Message ID: @.>

因为我们希望使用最终残差流的比例因子来缩放残差流的分量，而你无法从部分分量推断出最终范数 … 2024 年 8 月 7 日星期三 05:20，Mi Hao，@.> 写道：问题关于 'ActivationCache.py' 文件中的 'apply_ln_to_stack' 函数，其中“layernorm 尺度对于每个层、批次元素和位置的整个残差流都是全局的，这就是为什么我们需要使用缓存的尺度因子而不是仅仅应用一个新的 LayerNorm”是什么意思？也就是说，为什么我们需要使用原始推理的缓存尺度因子（在 layer_norm 中，尺度因子是均值和标准差），而不是直接对变量进行归一化？ — 直接回复此电子邮件，在 GitHub 上查看 < #696 >，或取消订阅 < https://github.com/notifications/unsubscribe-auth/ASRPNKLH6L6IYJUPB3FWUWDZQIGKPAVCNFSM6AAAAABMEJKWBWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMZUG44DGOA > 。您收到此邮件是因为您订阅了此主题。消息 ID：@.>

So, does Scale Factor have some meanings? I think it's just a operation for normalization. And it is related to the input sentence rather than the parameters that LLM learned. If so, may be we can directly normalize the COMPONENTS of the residual stream（For LN, subtract the mean and divide by the variance)，which might also be reasonable?

TransformerLensOrg / TransformerLens

About the cached layernorm scale factors #696

Question