argonne-lcf / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
9 stars 12 forks source link

Sequence parallelism #45

Open hatanp opened 4 months ago

hatanp commented 4 months ago

Currently it seems like both Megatron SP and DeepSpeed SP are not correctly implemented in Megatron-DeepSpeed. Maybe this was working once but since new features have been added there are conflicts between the two and for example flags that were once checking for megatron-SP were actually implemented to check for deepspeed SP. Sometimes these for example collect the wrong dimension like DeepSpeed SP does. Importantly the SP should also work with TP and PP to be useful for large scale training.

A ported Megatron-LM from 10/23 implements an SP succesfully but lacks some features such as some related to MoE, mendioned in issue #44

Eugene29 commented 2 months ago

Hi, the source of SP hanging seems to be related to this commit. With everything else held constant, the commits before works, but the ones after hangs.

hatanp commented 2 months ago

That is a separate known issue. There is a barrier currently only tensor parallel rank 0 joins and the fix is relatively easy but not yet implemented.