allenai / OLMoE

OLMoE: Open Mixture-of-Experts Language Models
https://arxiv.org/abs/2409.02060
Apache License 2.0
411 stars 32 forks source link

MOE Export Parallelism Training Script #8

Closed wdlctc closed 1 month ago

wdlctc commented 1 month ago

Hello OLMoE team,

I’m currently exploring training scripts for models using Mixture of Experts (MOE) and was wondering if there are any existing or planned scripts that handle expert parallelism during the export phase for MOE models? Specifically, I'm interested in techniques for parallelizing the export process for efficient training in distributed environments.

If not, could you provide any guidance on how to implement this or any references that would be useful for such a setup?

Thank you!

Best regards, cheng luo

Muennighoff commented 1 month ago

What is the export phase?

wdlctc commented 1 month ago

expert parallelism, sorry for the topo

Muennighoff commented 1 month ago

I see; you can activate expert parallelism here https://github.com/allenai/OLMo/blob/cd0004be3f5a82fff8b4b990a00be1377e084eac/olmo/config.py#L1369 but I think it is not working at the moment as it does not show the params in the param count when I tried. I think you need to do something with device mesh like here https://github.com/mosaicml/llm-foundry/blob/e8eca4fa83f3fec69ad482465f839fb7dcfbfb0d/llmfoundry/models/utils/config_moe_args.py#L68

Anyways regular FSDP w/ fully sharded is enough for most models & is what we used for OLMoE-1B-7B.

wdlctc commented 1 month ago

I got it; many thanks! Is the default Pretraining script using fsdp? Do you know where I can fine-tune fsdp setting?

Muennighoff commented 1 month ago

Yes it uses fsdp & you can change its settings here https://github.com/allenai/OLMoE/blob/b032a4a4984c3ec3cee21f81f26b70fa5f788a09/configs/OLMoE-1B-7B-0824.yml#L93