resume training techniques

OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

https://internvl.readthedocs.io/en/latest/

MIT License

5.48k stars 425 forks source link

resume training techniques #362

Closed ecnuycxie closed 1 week ago

ecnuycxie commented 2 months ago

Thanks for your excellent work and your effort on sharing the code.

Here I have a question when trying to train InternVL2:

In my experiment, I set --save_only_model to avoid saving the "global_step" checkpoint. But I found that the training loss did not converge after 1 epoch. When I restored the checkpoint and started continuous training, the loss increased (It may be because the parameters of adamw have not been restored). Are there some training tips for my experiment?

czczup commented 3 weeks ago

Can deepspeed train without saving the "global_step" checkpoint? In my understanding, this is necessary for resume because the optimizer's parameters are stored in it.