Research LayerNorm - Githubissues

Tentative:

Understanding the Difficulty of Training Transformers find that the mean and variance of layer normalisation (LN) are more important than the forward normalisation due to their re-centering and re-scaling of backward gradients. They also find that the parameters of LN increase the risk of overfitting and do not work in most cases. They propose Adaptive Normalisation (AdaNorm) that demonstrates better results than LN on 7/8 datasets.

On Layer Normalization in the Transformer Architecture show that the layer normalisation (LN) position in the vanilla Transformer causes the expected gradients of the parameters near the output layer to be large. Using a large learning rate on these gradients makes training unstable. They propose putting the LN inside of the residual blocks, before the multi-head attention and before the FFN. They verify that the training warm-up stage can be safely removed.

ChrisFuscoMasters / TransformerLib

Research LayerNorm #10