Pre and Post LayerNorm - Githubissues

rasbt commented 7 months ago

It's a nitpick, but I think Pythia uses Post-LayerNorm. However, in Lit-GPT, we may use Pre-LayerNorm:

https://github.com/Lightning-AI/lit-gpt/blob/0f021f3ad8cd8d6fe30b0ef721a7a0e2dac15898/lit_gpt/model.py#L136C1-L142C44

I.e.,

Pythia (https://github.com/EleutherAI/pythia) uses GPTNeoX (https://github.com/EleutherAI/gpt-neox)
GPTNeoX uses Megatron (https://github.com/EleutherAI/gpt-neox/tree/main/megatron)
There Megatron copy uses Post-LN: https://github.com/EleutherAI/gpt-neox/blob/90f70ff74613293c59391c9f8e469a8e56a75733/megatron/model/transformer.py#L917

It's also consistent with the Megatron TransFormerBlock here: https://github.com/NVIDIA/Megatron-LM/blob/599f558dbd7580a5c47b23f6cc5afab22221dff7/megatron/core/transformer/transformer_block.py#L95C1-L98C41

I may see this incorrectly though, because in the GPTNeoX paper it looks like they use PreLayerNorm:

Long story short, maybe Block in Lit-GPT should have a Norm argument like

class Block(nn.Module):
    def __init__(self, config: Config) -> None:
        super().__init__()
        ....

    def forward(...)
        if config.layernorm == "pre":
            n_1 = self.norm_1(x)
            h = self.attn(n_1, cos, sin, mask, input_pos)
        elif config.layernorm == "post":
            h = self.attn(x, cos, sin, mask, input_pos)
            n_1 = self.norm_1(h)

carmocca commented 7 months ago

This is the relevant gpt-neox reference, not Megatron: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L669-L723

rasbt commented 7 months ago

Thanks! It's still post LN though right?

https://github.com/huggingface/transformers/blob/3f69f415adcbdaedec154ba8eac220ef3276975d/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L706C35-L706C79

# pseudocode:
# x = x + attn(ln1(x)) + mlp(ln2(x))
mlp_output = self.mlp(self.post_attention_layernorm(hidden_states))

I mean, Pre LN is what I use in practice, which is what Lit-GPT also does. But maybe to reproduce some of these models more exactly, we probably want to allow Post LN?

carmocca commented 7 months ago

You have both, pre and post: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L673-L674 https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L690-L691 https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L706

Andrei-Aksionov commented 7 months ago

Hi guys 👋 I think that the naming in HF implementation introduced some level of confusion. Both normalization layers are applied before (aka Pre-LayerNorm), it's just the names are a bit confusing:

self.input_layernorm
self.post_attention_layernorm

If to strip down their code, then the ordering will be:

self.input_layernorm
self.attention
self.post_attention_layernorm
self.mlp

This is reflected in the provided pseudo-code:

# pseudocode:
# x = x + attn(ln1(x))
# x = x + mlp(ln2(x))

rasbt commented 7 months ago

Ah yes, thanks, this looks like pre-layernorm. Otherwise it would be x = x + ln1(attn(x))

Andrei-Aksionov commented 7 months ago

Closing?

rasbt commented 7 months ago

Yeah that's probably fine since no one is using Post-LN anymore.

Lightning-AI / litgpt

Pre and Post LayerNorm #892