Closed szhengac closed 11 months ago
Hi! We have trained MoEs with pipeline parallelism in Megatron. You should be able to take one of our training scripts and configure the pipeline parallelism arguments.
cc @deepakn94, who got pipeline parallelism to work with Megatron + MegaBlocks.
It seems all the four scripts you shared do not use pipeline parallelism?
They do not - you'll have to set the arguments to pass to Megatron, but it should work fine :)
I see. It seems you are using your own fork for Megatron-LM. Can you please point out the lines of code that support routing balancing loss with pipeline parallelism? I think the official Megatron-LM main branch does not support that.
Hi, I see there is balancing loss implementation with pipeline parallelism in moe.py. But I wonder how do you use it with Megatron-LM? It seem there is no example training code in the repository. With pipeline parallelism, we probably can only compute the backward of the balancing loss when the next stage passes the output gradient back.