Closed peiyingxin closed 8 months ago
Yes! Putting the MLP and attention layers in parallel is known to not hurt performance at scale while providing a substantial increase in training speed. It was introduced by GPT-J-6 and has previously been used by GPT-NeoX-20B, PaLM 1 and 2, ViT-22B, and many more. Experiments at different labs consistently report a 15% speed-up in training.
Its generally reported on without a full ablation, but the PaLM 1 paper and GPT-NeoX-20B paper both describe experiments showing this.
Hi, first of all, thanks for your great contributions to open research! I have confused about model architecture will influence model performance, I note that pythia model Layer Block like
pseudocode: x = x + attn(ln1(x)) + mlp(ln2(x))
and GPT or LLaMA Layer Block like
pseudocode: x = x + attn(ln1(x)) x = x + mlp(ln2(x))
have you been test performance about model architecture difference?