Tele-AI / Telechat

1.67k stars 85 forks source link

微调大模型之后保存的global_step问题 #50

Open Ricardo-Ping opened 4 days ago

Ricardo-Ping commented 4 days ago

怎么设置合理的step步数,为什么我的一直要跑到磁盘满了才行 epoch:1, global_step:925, step:3700 cur_batch_loss: 1.515671968460083 saving the final model ... convert lora to linear layer successfully! Traceback (most recent call last): File "E:\Model\Telechat\deepspeed-telechat\sft\main.py", line 415, in <module> main() File "E:\Model\Telechat\deepspeed-telechat\sft\main.py", line 406, in main save_hf_format(model, tokenizer, args) File "E:\Model\Telechat\deepspeed-telechat\utils\utils.py", line 47, in save_hf_format model_to_save.save_pretrained(output_dir, state_dict=save_dict) File "E:\Anaconda\envs\Telechat\lib\site-packages\transformers\modeling_utils.py", line 2486, in save_pretrained safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"}) File "E:\Anaconda\envs\Telechat\lib\site-packages\safetensors\torch.py", line 281, in save_file serialize_file(_flatten(tensors), filename, metadata=metadata) safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 112, kind: StorageFull, message: "□□□̿ռ䲻□㡣" })