Open bozheng-hit opened 5 months ago
BTW, is there an example for enabling moe_weight_parallism in megablocks?
Interesting! This could be a number of things. We have a flag to compute the LBL in float32, which might help rule out the numerics. Did you change anything else about your setup? e.g., changes in weight initialization, data order, etc.
BTW, is there an example for enabling moe_weight_parallism in megablocks?
We haven't plumbed support for MoE weight parallelism in our Megatron-LM fork, which it sounds like you're using. There are some users who do use it but I don't believe their frameworks are open-source (yet!).
I think it is caused by precision. Is there a plan for supporting MoE weight parallelism in the Megatron-LM fork?
Great, if it is caused by precision I would recommend trying FP32 and see if that resolves the difference.
Supporting weight parallelism in Megatron-LM is easy-ish, depending on how you'd like to use it. Could you share the sharding configuration that you'd like to use?
I'd like to use weight parallelism in dMoE with SwiGLU layers. I noticed that weight parallelism is not supported with GLU yet.
Ah yes, it is not. Is there an issue with expert model parallelism for your setup?
Yes, the expert model parallelism does not support distributed optimizer, and it uses the data parallel group, which is inconvenient for larger-scale pre-training.
Yes, our Megatron-LM integration is certainly not sufficient for large scale training. There is more proper framework integration in the pipeline, but it won't be available for a bit unfortunately.
If you're interested in using MegaBlocks in some way that isn't currently supported by our Megatron fork I'm happy to answer any questions you have implementing it!
I load the same model trained with megatron + megablocks, and I found the load_balancing_loss is slightly different. When I increase the pipeline_parallel_size, the load_balancing_loss is also increasing. Is it just the problem of precision or there is a potential bug?
For example, when I train a 500M gpt model with 64 experts, I list the lbl and pp_size in the table below.