微调保存时，8张显卡的结果都保存结果需要很大存储空间，是否可以在一张显卡中保存

yangzhipeng1108 commented 1 year ago

finetune_moss.py if global_step % args.save_step == 0 and torch.cuda.current_device() == 0: model.save_checkpoint(args.output_dir, global_step)

yangzhipeng1108 commented 1 year ago

[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10646, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808898 milliseconds before timing out. [2023-05-12 09:16:38,853] [INFO] [logging.py:93:log_dist] [Rank 0] [Torch] Checkpoint 24 is about to be saved! Traceback (most recent call last): File "finetune_moss.py", line 310, in train(args)
File "finetune_moss.py", line 272, in train model.save_checkpoint(args.output_dir, global_step) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 3120, in save_checkpoint self._checkpoint_tag_validation(tag) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 3072, in _checkpoint_tag_validation dist.all_reduce(max_bhash, op=dist.ReduceOp.MAX) File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 123, in log_wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 526, in all_reduce return cdb.all_reduce(tensor, op, group, async_op) File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/torch.py", line 53, in all_reduce return torch.distributed.all_reduce(tensor=tensor, File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1534, in all_reduce work = default_pg.allreduce([tensor], opts) RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10646, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808898 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10646, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808898 milliseconds before timing out.

yangzhipeng1108 commented 1 year ago

solving the problem in this project , https://github.com/yangzhipeng1108/moss-finetune

Franocris commented 1 year ago

你好，能看下你的训练集与验证集吗

OpenMOSS / MOSS

微调保存时，8张显卡的结果都保存结果需要很大存储空间，是否可以在一张显卡中保存 #275