Open rpand002 opened 5 months ago
You're using our Megatron fork with MegaBlocks integrated? What kind of system are you on? A100, H100, etc.?
@tgale96 Thank you for the great work. I experienced the same slow down as @rpand002. I'm using A100 system, and w/ your megatron fork. Multi-training script for the reference will be a great help.
Our Megatron fork is mostly for small-scale experiments and uses the data parallel process group for expert model parallelism. If you scale out to multiple nodes with data parallelism and expert parallelism enabled you'll do expert parallelism across those nodes, which can be slow because the all2alls become a bit expensive.
One thing you could try is using pipeline parallelism between nodes. If you were to use MegaBlocks in a custom framework, I'd recommend using something like FSDP across nodes and expert parallelism within each node.
I do not have reference scripts for multi-node training, but for pipeline parallelism the flags are the same as they are in upstream Megatron-LM. I hope this helps!
Thanks for the excellent work. Following the comment in #59, I am trying to train
dmoe_760m
using 16 GPUs (2 nodes) by changing distributed arguments to set up for two nodes but it is very slow in terms of elapsed time per iteration (ms). Can you suggest an optimal training configuration for multi-node training? A full-fledged multi-training script would be very helpful.@tgale96