alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
674 stars 94 forks source link

Llava不支持训到中途失败后基于已保存的checkpoint再次续训的逻辑嘛 #225

Closed liulong11 closed 1 month ago

liulong11 commented 4 months ago

通过finetune_megatron_llava.py进行llava的训练,训练到中途并保存一个checkpoint,然后手动中断训练。再次重新启动训练,当前不会自动识别到我已有保存的checkpoint并基于该checkpoint继续训练,而是又重头开始训练了。