fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
440 stars 35 forks source link

OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory #43

Open gongye19 opened 6 months ago

gongye19 commented 6 months ago

parallel-sft训练完后保存的模型文件有问题,少了配置文件

OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory

fe1ixxu commented 6 months ago

Could you please share you env config and training command? I re-run the script and I do not have this issue.

gongye19 commented 6 months ago

Could you please share you env config and training command? I re-run the script and I do not have this issue.

deepspeed 用zero3 以及cpu offload会导致最后保存的问题,我cpo阶段换成zero2就可以正常保存了