allenai / OLMoE

OLMoE: Open Mixture-of-Experts Language Models
https://arxiv.org/abs/2409.02060
Apache License 2.0
478 stars 37 forks source link

MOE Export Parallelism Training Script #8

Closed wdlctc closed 2 months ago

wdlctc commented 2 months ago

Hello OLMoE team,

I’m currently exploring training scripts for models using Mixture of Experts (MOE) and was wondering if there are any existing or planned scripts that handle expert parallelism during the export phase for MOE models? Specifically, I'm interested in techniques for parallelizing the export process for efficient training in distributed environments.

If not, could you provide any guidance on how to implement this or any references that would be useful for such a setup?

Thank you!

Best regards, cheng luo

Muennighoff commented 2 months ago

What is the export phase?

wdlctc commented 2 months ago

expert parallelism, sorry for the topo

Muennighoff commented 2 months ago

I see; you can activate expert parallelism here https://github.com/allenai/OLMo/blob/cd0004be3f5a82fff8b4b990a00be1377e084eac/olmo/config.py#L1369 but I think it is not working at the moment as it does not show the params in the param count when I tried. I think you need to do something with device mesh like here https://github.com/mosaicml/llm-foundry/blob/e8eca4fa83f3fec69ad482465f839fb7dcfbfb0d/llmfoundry/models/utils/config_moe_args.py#L68

Anyways regular FSDP w/ fully sharded is enough for most models & is what we used for OLMoE-1B-7B.

wdlctc commented 2 months ago

I got it; many thanks! Is the default Pretraining script using fsdp? Do you know where I can fine-tune fsdp setting?

Muennighoff commented 2 months ago

Yes it uses fsdp & you can change its settings here https://github.com/allenai/OLMoE/blob/b032a4a4984c3ec3cee21f81f26b70fa5f788a09/configs/OLMoE-1B-7B-0824.yml#L93