NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.57k stars 2.36k forks source link

[QUESTION] How to enable ZeRO 2/3 stages ? #1156

Closed polisettyvarma closed 2 days ago

polisettyvarma commented 1 month ago

How to enable ZeRO 2/3 stages ? similar to #589

lmcafee-nvidia commented 1 month ago

I responded to this on https://github.com/NVIDIA/Megatron-LM/issues/589.

polisettyvarma commented 1 month ago

please convert this issue to feature request for ZeRO 2/3 Thank you.

carolove commented 1 month ago

i think article,https://www.deepspeed.ai/tutorials/megatron/, is useful. deepspeed ZeRO 1/2 works with Megatron-lm latest code.

polisettyvarma commented 1 month ago

@carolove Thanks for the inputs, i am familiar with deepspeed framework to enable all ZeRO stages. here query is regarding enabling ZeRO in this repo natively. can you please share commits which added ZeRO 2 support in latest code of this repo. Thank you.

carolove commented 1 month ago

I also look for such example~.

SeunghyunSEO commented 1 month ago

megatron-lm now has its own zero-1 (it is called distributed optimizer in this project), but if u are more familiar with deepspeed, then how bout using deepspeed-megatron, @polisettyvarma ? And to my best knowledge, zero-3 is not compatible with model parallelism (TP or PP) of megatron-lm. zero-3 reduce vram memory and improve throughput by partitioning and broadcasting model parameters but TP or PP partition its own way and rather communicate activations (all-reduce activations for backward and forward). So TP or PP has no room for communicating model parameters.

polisettyvarma commented 1 month ago

Thank you @SeunghyunSEO for your inputs. Yes Megatron-DeepSpeed repo can be used but it's not up to date with Megatron-LM. I agree on Zero > 1 is not compatible with PP. My request here is some similar feature of ZeRO on Megatron-LM.

deepakn94 commented 1 month ago

We should have PyTorch FSDP support compatible with TP in the next couple of weeks.

polisettyvarma commented 1 month ago

Thank you @deepakn94 for sharing this information.

SeunghyunSEO commented 2 days ago

@polisettyvarma @deepakn94 https://github.com/NVIDIA/Megatron-LM/commit/e1993fa6f70763523a84432ab1f5eb42e77ccf2a#diff-a7ca552e38c01a3a0cacbe37cec383c05743aeaf8143e57fd0901f4139d4a1a9R119 merged into main 2 hours ago