bubbliiiing / yolov8-pytorch

这是一个yolov8-pytorch的仓库,可以用于训练自己的数据集。
GNU General Public License v3.0
678 stars 80 forks source link

YoloV8单机多卡训练第一个epoch中途卡住 #42

Open answerman1 opened 11 months ago

answerman1 commented 11 months ago

在同一个数据集使用了作者的Yolov7和Yolov8两个版本的多卡训练版本。训练参数都一样,结果在Yolov7上多卡训练正常,在Yolov8上多卡训练会因为loss.backward()这一步,导致nccl timeout,最终训练终止。对比了两个版本的训练代码,好像只有loss部分变了,是因为这个原因吗?或者是其他什么原因呢?

scaler.scale(loss_value).backward() File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply return user_fn(self, *args) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 130, in backward torch.distributed.all_reduce( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=113679, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809876 milliseconds before timing out. Traceback (most recent call last): File "train.py", line 608, in fit_one_epoch(model_train, model, ema, yolo_loss, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir, local_rank)