Open yangzhipeng1108 opened 1 year ago
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10646, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808898 milliseconds before timing out.
[2023-05-12 09:16:38,853] [INFO] [logging.py:93:log_dist] [Rank 0] [Torch] Checkpoint 24 is about to be saved!
Traceback (most recent call last):
File "finetune_moss.py", line 310, in
File "finetune_moss.py", line 272, in train
model.save_checkpoint(args.output_dir, global_step)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 3120, in save_checkpoint
self._checkpoint_tag_validation(tag)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 3072, in _checkpoint_tag_validation
dist.all_reduce(max_bhash, op=dist.ReduceOp.MAX)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 123, in log_wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 526, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/torch.py", line 53, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1534, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10646, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808898 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10646, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808898 milliseconds before timing out.
solving the problem in this project , https://github.com/yangzhipeng1108/moss-finetune
你好,能看下你的训练集与验证集吗
finetune_moss.py if global_step % args.save_step == 0 and torch.cuda.current_device() == 0: model.save_checkpoint(args.output_dir, global_step)