OpenMOSS / CoLLiE

Collaborative Training of Large Language Models in an Efficient Way
https://openlmlab-collie.readthedocs.io
Apache License 2.0
405 stars 58 forks source link

关于模型中断,重启的问题,怎么让模型继续训练 #142

Closed 459737087 closed 8 months ago

459737087 commented 9 months ago

我已经保存模型了,怎么让它加载之前的模型继续跑。

KaiLv69 commented 9 months ago

trainer有load_model方法加载保存好的模型权重,load_checkpoint方法加载保存点。在训练开始前调用load_model或者load_checkpoint可以继续之前的训练。

459737087 commented 9 months ago

你好,load_checkpoint没办法同时分布到多张卡上,这里的代码是不是有问题啊? @KaiLv69

KaiLv69 commented 9 months ago

具体报错信息是什么?load_checkpoint时需要保持前后两次训练并行设置一样

459737087 commented 9 months ago

OOM ,CudaOutOfMemory @KaiLv69 ,And I found the checkpoint only ran in a single graphics card.

459737087 commented 9 months ago

还有一个问题,就是load_checkpoint之后是从头训练还是继续训练,比如说加载一个跑了10个epoch的模型,结果显示是从0开始跑,10个epoch之后保存的名字还是10,这是不是说明他这个不属于继续训练,它属于从头训练。 @KaiLv69

KaiLv69 commented 8 months ago

新版本已经解决这个问题了