Closed liuzhipengchd closed 7 months ago
eval_loss: 5.4856109619140625, eval_accuracy: 0.23814019353316027, eval_runtime: 2.8708, eval_samples_per_second: 27.518, eval_steps_per_second: 3.483, eval_ppl: 241.19626072010593, epoch: 0.0056
2%|▏ | 50/3000 [00:53<41:20, 1.19it/s]
100%|██████████| 10/10 [00:02<00:00, 4.00it/s][2023-06-08 20:52:02,089] [ INFO] - Saving model checkpoint to ./checkpoints/chatglm-6b/checkpoint-50
[2023-06-08 20:52:02,100] [ INFO] - Configuration saved in ./checkpoints/chatglm-6b/checkpoint-50/config.json
[2023-06-08 20:52:09,430] [ INFO] - tokenizer config file saved in ./checkpoints/chatglm-6b/checkpoint-50/tokenizer_config.json
[2023-06-08 20:52:09,430] [ INFO] - Special tokens file saved in ./checkpoints/chatglm-6b/checkpoint-50/special_tokens_map.json
LAUNCH INFO 2023-06-08 20:52:30,807 Exit code -9
看下 log/workerlog.0
或者 log/workerlog.1
里边有详细报错吗
看下
log/workerlog.0
或者log/workerlog.1
里边有详细报错吗
日志已经没有了。最后采用模型只save一次。。checkpoint的存梯度的文件太大了,一个节点就48g。。是不是这个原因?导致文件流写入失败
软件环境
重复问题
错误描述
稳定复现步骤 & 代码