deepspeed跑模型相关问题

Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案，结构参考alpaca

https://github.com/Facico/Chinese-Vicuna

Apache License 2.0

4.14k stars 425 forks source link

Open sunpenglv opened 1 year ago

sunpenglv commented 1 year ago

你好，我在使用deepspeed跑全量微调的时候，刚好内存够用，但是期间有其他用户使用GPU后，deepspeed会因为显存不够而终止运行导致训练失败。如何能在检测到显存不够时将模型保存下来再退出，以便后续继续训练？