Closed wplf closed 2 months ago
Hi, there. Thank you for great project.
I have a question about checkpoint save and load. My model is saved as normal state dict, but split into 8 parts. How can I load them in 4 or 2 GPUs, with sufficient GPU memory?
iter_0150000/mp_rank_00: distrib_optim.pt model_optim_rng.pt iter_0150000/mp_rank_01: distrib_optim.pt model_optim_rng.pt iter_0150000/mp_rank_02: distrib_optim.pt model_optim_rng.pt iter_0150000/mp_rank_03: distrib_optim.pt model_optim_rng.pt iter_0150000/mp_rank_04: distrib_optim.pt model_optim_rng.pt iter_0150000/mp_rank_05: distrib_optim.pt model_optim_rng.pt iter_0150000/mp_rank_06: distrib_optim.pt model_optim_rng.pt iter_0150000/mp_rank_07: distrib_optim.pt model_optim_rng.pt
Hi, I‘ve found the solution. tools/checkpoint/saver_megatron.py might do this job. Thank you for great repo, again!
tools/checkpoint/saver_megatron.py
Hi, there. Thank you for great project.
I have a question about checkpoint save and load.
My model is saved as normal state dict, but split into 8 parts. How can I load them in 4 or 2 GPUs, with sufficient GPU memory?