databricks / megablocks

Apache License 2.0
1.11k stars 154 forks source link

different load_balancing_loss with different pipeline_parallel_size #85

Open bozheng-hit opened 5 months ago

bozheng-hit commented 5 months ago

I load the same model trained with megatron + megablocks, and I found the load_balancing_loss is slightly different. When I increase the pipeline_parallel_size, the load_balancing_loss is also increasing. Is it just the problem of precision or there is a potential bug?

For example, when I train a 500M gpt model with 64 experts, I list the lbl and pp_size in the table below.

pp_size lbl
1 1.005E-01
2 1.007E-01
4 1.013E-01
bozheng-hit commented 5 months ago

BTW, is there an example for enabling moe_weight_parallism in megablocks?

tgale96 commented 5 months ago

Interesting! This could be a number of things. We have a flag to compute the LBL in float32, which might help rule out the numerics. Did you change anything else about your setup? e.g., changes in weight initialization, data order, etc.

BTW, is there an example for enabling moe_weight_parallism in megablocks?

We haven't plumbed support for MoE weight parallelism in our Megatron-LM fork, which it sounds like you're using. There are some users who do use it but I don't believe their frameworks are open-source (yet!).

bozheng-hit commented 5 months ago

I think it is caused by precision. Is there a plan for supporting MoE weight parallelism in the Megatron-LM fork?

tgale96 commented 5 months ago

Great, if it is caused by precision I would recommend trying FP32 and see if that resolves the difference.

Supporting weight parallelism in Megatron-LM is easy-ish, depending on how you'd like to use it. Could you share the sharding configuration that you'd like to use?

bozheng-hit commented 5 months ago

I'd like to use weight parallelism in dMoE with SwiGLU layers. I noticed that weight parallelism is not supported with GLU yet.

tgale96 commented 5 months ago

Ah yes, it is not. Is there an issue with expert model parallelism for your setup?

bozheng-hit commented 5 months ago

Yes, the expert model parallelism does not support distributed optimizer, and it uses the data parallel group, which is inconvenient for larger-scale pre-training.

tgale96 commented 5 months ago

Yes, our Megatron-LM integration is certainly not sufficient for large scale training. There is more proper framework integration in the pipeline, but it won't be available for a bit unfortunately.

If you're interested in using MegaBlocks in some way that isn't currently supported by our Megatron fork I'm happy to answer any questions you have implementing it!