单机多卡训练出现以下错误:
RuntimeError: NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=204699, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800104 milliseconds before timing out.
Traceback如下:
Traceback (most recent call last):
File "train.py", line 610, in
fit_one_epoch(model_train, model, ema, yolo_loss, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir, local_rank)
File "/apsara/TempRoot/Odps/ytrec_20231222071249521gmacm0sr1bm6_93d0ce33_26a4_47d3_889d_d09f09e82671_AlgoTask_0_0/PyTorchWorker@l80h15251.ea120#0/workspace/utils/utils_fit.py", line 54, in fit_one_epoch
scaler.scale(loss_value).backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 130, in backward
torch.distributed.all_reduce(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
work = group.allreduce([tensor], opts)
单机多卡训练出现以下错误: RuntimeError: NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=204699, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800104 milliseconds before timing out. Traceback如下: Traceback (most recent call last): File "train.py", line 610, in
fit_one_epoch(model_train, model, ema, yolo_loss, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir, local_rank)
File "/apsara/TempRoot/Odps/ytrec_20231222071249521gmacm0sr1bm6_93d0ce33_26a4_47d3_889d_d09f09e82671_AlgoTask_0_0/PyTorchWorker@l80h15251.ea120#0/workspace/utils/utils_fit.py", line 54, in fit_one_epoch
scaler.scale(loss_value).backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 130, in backward
torch.distributed.all_reduce(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
work = group.allreduce([tensor], opts)
请问大家如何解决呢