The performance about pythia and LLaMA model architecture

EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics

Apache License 2.0

2.14k stars 156 forks source link

The performance about pythia and LLaMA model architecture #122

Closed peiyingxin closed 8 months ago

peiyingxin commented 9 months ago

Hi, first of all, thanks for your great contributions to open research! I have confused about model architecture will influence model performance, I note that pythia model Layer Block like

pseudocode: x = x + attn(ln1(x)) + mlp(ln2(x))

and GPT or LLaMA Layer Block like

pseudocode: x = x + attn(ln1(x)) x = x + mlp(ln2(x))

have you been test performance about model architecture difference?

StellaAthena commented 8 months ago

Yes! Putting the MLP and attention layers in parallel is known to not hurt performance at scale while providing a substantial increase in training speed. It was introduced by GPT-J-6 and has previously been used by GPT-NeoX-20B, PaLM 1 and 2, ViT-22B, and many more. Experiments at different labs consistently report a 15% speed-up in training.

Its generally reported on without a full ablation, but the PaLM 1 paper and GPT-NeoX-20B paper both describe experiments showing this.