bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.32k stars 213 forks source link

allocate embed norm only on pp0 #261

Closed stas00 closed 2 years ago

stas00 commented 2 years ago

Don't allocate embed LN on pp rank -1, since it's not being used and it just wastes gpu memory if we allocate it.