Open hatanp opened 4 months ago
Hi, the source of SP hanging seems to be related to this commit. With everything else held constant, the commits before works, but the ones after hangs.
That is a separate known issue. There is a barrier currently only tensor parallel rank 0 joins and the fix is relatively easy but not yet implemented.
Currently it seems like both Megatron SP and DeepSpeed SP are not correctly implemented in Megatron-DeepSpeed. Maybe this was working once but since new features have been added there are conflicts between the two and for example flags that were once checking for megatron-SP were actually implemented to check for deepspeed SP. Sometimes these for example collect the wrong dimension like DeepSpeed SP does. Importantly the SP should also work with TP and PP to be useful for large scale training.
A ported Megatron-LM from 10/23 implements an SP succesfully but lacks some features such as some related to MoE, mendioned in issue #44