[Help] 请问如何能做到微调过程中不保存早期的checkpoint

THUDM / ChatGLM2-6B

ChatGLM2-6B: An Open Bilingual Chat LLM | 开源双语对话语言模型

Other

15.7k stars 1.85k forks source link

[Help] 请问如何能做到微调过程中不保存早期的checkpoint #654

Open ybdesire opened 8 months ago

ybdesire commented 8 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

比如，微调模型的配置如下

    --max_steps 3000 \
    --save_steps 5\

这样保存的checkpoint就会从5, 10, 15, 20, ..., 3000。这样就保存太多checkpoint了。

我想跳过step小于2000的部分，就是只保存checkpoint从 2000, 2005, 2010, ..., 3000。请问应该如何配置呢？

Expected Behavior

No response

Steps To Reproduce

    --max_steps 3000 \
    --save_steps 5\

Environment

OS: Ubuntu 20.04
Python: 3.8
Transformers: 4.26.1
PyTorch: 1.12
CUDA Support: True

Anything else?

No response

hhy150 commented 8 months ago

（1）可以先训练一个2000的，设置 --max_steps 2000 \ --save_steps 2000 （2）然后在上面继续训练，设置 --max_steps 3000 \ --save_steps 5

ybdesire commented 8 months ago

（1）可以先训练一个2000的，设置 --max_steps 2000 --save_steps 2000 （2）然后在上面继续训练，设置 --max_steps 3000 --save_steps 5

感谢回复，这也是个思路。请问有没有能直接一次训练就能做到的方法？因为有些平台上提交训练没法中断后再接着训练这样操作

hhy150 commented 8 months ago

（1）可以先训练一个2000的，设置 --max_steps 2000 --save_steps 2000 （2）然后在上面继续训练，设置 --max_steps 3000 --save_steps 5

感谢回复，这也是个思路。请问有没有能直接一次训练就能做到的方法？因为有些平台上提交训练没法中断后再接着训练这样操作

这个我就不太知道了，抱歉