databricks / megablocks

Apache License 2.0
1.22k stars 175 forks source link

How do you use routing balancing loss under pipeline parallelism #64

Closed szhengac closed 11 months ago

szhengac commented 11 months ago

Hi, I see there is balancing loss implementation with pipeline parallelism in moe.py. But I wonder how do you use it with Megatron-LM? It seem there is no example training code in the repository. With pipeline parallelism, we probably can only compute the backward of the balancing loss when the next stage passes the output gradient back.

tgale96 commented 11 months ago

Hi! We have trained MoEs with pipeline parallelism in Megatron. You should be able to take one of our training scripts and configure the pipeline parallelism arguments.

cc @deepakn94, who got pipeline parallelism to work with Megatron + MegaBlocks.

szhengac commented 11 months ago

It seems all the four scripts you shared do not use pipeline parallelism?

tgale96 commented 11 months ago

They do not - you'll have to set the arguments to pass to Megatron, but it should work fine :)

szhengac commented 11 months ago

I see. It seems you are using your own fork for Megatron-LM. Can you please point out the lines of code that support routing balancing loss with pipeline parallelism? I think the official Megatron-LM main branch does not support that.

tgale96 commented 11 months ago

You can see the changes we made in the top four commits from Deepak here. Hope this helps!