Closed AlvL1225 closed 1 year ago
add load_optimizer_states=False
and load_lr_scheduler_states=False
while loading checkpoint.
engine.load_checkpoint(model_args.init_ckpt,
load_module_only=True,
load_optimizer_states=False,
load_lr_scheduler_states=False,
In addition, I modified the way of loading checkpoint, so that it can skip zero_pp_xxx
. See this commit
In addition, I modified the way of loading checkpoint, so that it can skip
zero_pp_xxx
. See this commit
thanks! I have another question. When I tried to use pp4dp2 in a 8xA100 node, I encountered
RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
when initializing engine
Do you have any idea on this?
Hi Huang, nice work!
when I tried to train with a 13B model, I got the error: [Errno 2] No such file or directory: 'llama_13b_pp/global_step001/zero_pp_rank_0_mp_rank_03_optim_states.pt'
Any ideas on this? The 'convert2ckpt.py' script does not generate files with prefix 'zeropp....'