NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.47k stars 2.35k forks source link

[QUESTION] How to save model trained by TP=8 and load model in TP=4 or other? #1009

Closed wplf closed 2 months ago

wplf commented 2 months ago

Hi, there. Thank you for great project.

I have a question about checkpoint save and load.
My model is saved as normal state dict, but split into 8 parts. How can I load them in 4 or 2 GPUs, with sufficient GPU memory?

iter_0150000/mp_rank_00:
distrib_optim.pt  model_optim_rng.pt

iter_0150000/mp_rank_01:
distrib_optim.pt  model_optim_rng.pt

iter_0150000/mp_rank_02:
distrib_optim.pt  model_optim_rng.pt

iter_0150000/mp_rank_03:
distrib_optim.pt  model_optim_rng.pt

iter_0150000/mp_rank_04:
distrib_optim.pt  model_optim_rng.pt

iter_0150000/mp_rank_05:
distrib_optim.pt  model_optim_rng.pt

iter_0150000/mp_rank_06:
distrib_optim.pt  model_optim_rng.pt

iter_0150000/mp_rank_07:
distrib_optim.pt  model_optim_rng.pt
wplf commented 2 months ago

Hi, I‘ve found the solution. tools/checkpoint/saver_megatron.py might do this job. Thank you for great repo, again!