HuangLK / transpeeder

train llama on a single A100 80G node using 🤗 transformers and 🚀 Deepspeed Pipeline Parallelism
Apache License 2.0
208 stars 18 forks source link

File not found error #28

Closed AlvL1225 closed 1 year ago

AlvL1225 commented 1 year ago

Hi Huang, nice work!

when I tried to train with a 13B model, I got the error: [Errno 2] No such file or directory: 'llama_13b_pp/global_step001/zero_pp_rank_0_mp_rank_03_optim_states.pt'

Any ideas on this? The 'convert2ckpt.py' script does not generate files with prefix 'zeropp....'

HuangLK commented 1 year ago

add load_optimizer_states=False and load_lr_scheduler_states=False while loading checkpoint.

    engine.load_checkpoint(model_args.init_ckpt,
                           load_module_only=True,
                           load_optimizer_states=False,
                           load_lr_scheduler_states=False,
HuangLK commented 1 year ago

In addition, I modified the way of loading checkpoint, so that it can skip zero_pp_xxx. See this commit

AlvL1225 commented 1 year ago

In addition, I modified the way of loading checkpoint, so that it can skip zero_pp_xxx. See this commit

thanks! I have another question. When I tried to use pp4dp2 in a 8xA100 node, I encountered

RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.when initializing engine Do you have any idea on this?