Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
9.75k stars 973 forks source link

Pre and Post LayerNorm #892

Closed rasbt closed 7 months ago

rasbt commented 7 months ago

It's a nitpick, but I think Pythia uses Post-LayerNorm. However, in Lit-GPT, we may use Pre-LayerNorm:

https://github.com/Lightning-AI/lit-gpt/blob/0f021f3ad8cd8d6fe30b0ef721a7a0e2dac15898/lit_gpt/model.py#L136C1-L142C44

I.e.,

  1. Pythia (https://github.com/EleutherAI/pythia) uses GPTNeoX (https://github.com/EleutherAI/gpt-neox)
  2. GPTNeoX uses Megatron (https://github.com/EleutherAI/gpt-neox/tree/main/megatron)
  3. There Megatron copy uses Post-LN: https://github.com/EleutherAI/gpt-neox/blob/90f70ff74613293c59391c9f8e469a8e56a75733/megatron/model/transformer.py#L917

I may see this incorrectly though, because in the GPTNeoX paper it looks like they use PreLayerNorm:

Long story short, maybe Block in Lit-GPT should have a Norm argument like

class Block(nn.Module):
    def __init__(self, config: Config) -> None:
        super().__init__()
        ....

    def forward(...)
        if config.layernorm == "pre":
            n_1 = self.norm_1(x)
            h = self.attn(n_1, cos, sin, mask, input_pos)
        elif config.layernorm == "post":
            h = self.attn(x, cos, sin, mask, input_pos)
            n_1 = self.norm_1(h)
carmocca commented 7 months ago

This is the relevant gpt-neox reference, not Megatron: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L669-L723

rasbt commented 7 months ago

Thanks! It's still post LN though right?

https://github.com/huggingface/transformers/blob/3f69f415adcbdaedec154ba8eac220ef3276975d/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L706C35-L706C79

# pseudocode:
# x = x + attn(ln1(x)) + mlp(ln2(x))
mlp_output = self.mlp(self.post_attention_layernorm(hidden_states))

I mean, Pre LN is what I use in practice, which is what Lit-GPT also does. But maybe to reproduce some of these models more exactly, we probably want to allow Post LN?

carmocca commented 7 months ago

You have both, pre and post: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L673-L674 https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L690-L691 https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L706

Andrei-Aksionov commented 7 months ago

Hi guys 👋 I think that the naming in HF implementation introduced some level of confusion. Both normalization layers are applied before (aka Pre-LayerNorm), it's just the names are a bit confusing:

If to strip down their code, then the ordering will be:

  1. self.input_layernorm
  2. self.attention
  3. self.post_attention_layernorm
  4. self.mlp

This is reflected in the provided pseudo-code:

# pseudocode:
# x = x + attn(ln1(x))
# x = x + mlp(ln2(x))
rasbt commented 7 months ago

Ah yes, thanks, this looks like pre-layernorm. Otherwise it would be x = x + ln1(attn(x))

Andrei-Aksionov commented 7 months ago

Closing?

rasbt commented 7 months ago

Yeah that's probably fine since no one is using Post-LN anymore.