WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801678 milliseconds before timing out.

letdivedeep commented 2 months ago

Hi @WongKinYiu @ws6125 Thanks for the wonderful work.

When starting the model training on a custom dataset on multi GPU setup, after few epochs i get the WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801678 milliseconds before timing out. error. I am using the aws g5.24xlarge instances for this tracelogs which is a 4gpu instance(A10 GPU).

I am using the CUDA 11.4 with NCCL version as 2.11.4-1. I also tried reducing the batchsize and increasing the docker shared memory still the issue persits.

below is the docker cmd I am using to start the docker : docker run -it --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 4446:8888 -p 4445:6006 --name dashcam-yolov9-de -it -v /home/ubuntu/dataset:/yolov9/dataset/ -v /home/ubuntu/saved_models:/yolov9/saved_models/ --shm-size=64g dashcam-aws-yolov9:v1 and then for running the model training, I am using this command : python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 segment/train.py --workers 88 --device 0,1,2,3 --sync-bn --batch 24 --data data/coco_road_signs_segment.yaml --img 1280 --cfg models/segment/gelan-c-seg.yaml --weights gelan-c-seg.pt --name gelan-c-seg --hyp hyp.scratch-high.yaml --epochs 60 --close-mosaic 15 --save-period 4 --label-smoothing 0.08 --noval --optimizer 'AdamW' --project saved_models/yolov9_seg_chk/ I Below is the short version of the error stack:


Plotting labels to saved_models/yolov9_v4_13July/gelan-c-seg/labels.jpg...
Image sizes 1280 train, 1280 val
Using 40 dataloader workers
Logging results to saved_models/yolov9_v4_13July/gelan-c-seg
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       0/99      17.9G       0.89     0.4345      0.965     0.9191         29       1280: 100%|██████████| 6463/6463 1:17:53
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 4560/4560 20:33
                   all      45592     146403      0.734      0.573      0.626      0.482      0.728      0.544       0.59      0.375

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       1/99      16.8G     0.8084     0.3834      0.795     0.8901         21       1280: 100%|██████████| 6463/6463 1:16:39
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95)     Mask(P          R      mAP50  mAP50-95): 100%|██████████| 4560/4560 20:21
                   all      45592     146403      0.778      0.637      0.693      0.551      0.775      0.605      0.658      0.425

      Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
       2/99      16.8G     0.7481     0.3544     0.7066     0.8709         13       1280: 100%|█████████▉| 6462/6463 1:14:15[E ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800752 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800849 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:607] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800866 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800867 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800895 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800909 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800910 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801678 milliseconds before timing out.
       2/99      16.8G     0.7481     0.3544     0.7066     0.8709         13       1280: 100%|█████████▉| 6462/6463 1:44:52
Traceback (most recent call last):
  File "segment/train.py", line 646, in <module>
    main(opt)
  File "segment/train.py", line 542, in main
    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 304, in train
    scaler.scale(loss).backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 352, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
    torch.distributed.all_reduce(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1288, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801678 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:361] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801678 milliseconds before timing out.
Traceback (most recent call last):
  File "segment/train.py", line 646, in <module>
Traceback (most recent call last):
  File "segment/train.py", line 646, in <module>
    main(opt)
  File "segment/train.py", line 542, in main
    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 304, in train
    main(opt)
  File "segment/train.py", line 542, in main
    scaler.scale(loss).backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 352, in backward
    train(opt.hyp, opt, device, callbacks)
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)  File "segment/train.py", line 304, in train

  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
scaler.scale(loss).backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 352, in backward
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    torch.distributed.all_reduce(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1288, in all_reduce
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
    torch.distributed.all_reduce(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1288, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 2.  Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800909 milliseconds before timing out.
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 1.  Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800910 milliseconds before timing out.
Traceback (most recent call last):
  File "segment/train.py", line 646, in <module>
    main(opt)
  File "segment/train.py", line 542, in main
    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 304, in train
    scaler.scale(loss).backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 352, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
    torch.distributed.all_reduce(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1288, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 3.  Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800867 milliseconds before timing out.
Traceback (most recent call last):
  File "segment/train.py", line 646, in <module>
    main(opt)
  File "segment/train.py", line 542, in main
    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 304, in train
    scaler.scale(loss).backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 352, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
    torch.distributed.all_reduce(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1288, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 4.  Original reason for failure was: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800866 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:361] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800909 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:361] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800910 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:361] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800867 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:361] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800866 milliseconds before timing out.
Traceback (most recent call last):
  File "segment/train.py", line 646, in <module>
Traceback (most recent call last):
  File "segment/train.py", line 646, in <module>
    main(opt)
  File "segment/train.py", line 542, in main
    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 304, in train
        main(opt)scaler.scale(loss).backward()

  File "segment/train.py", line 542, in main
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 352, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)train(opt.hyp, opt, device, callbacks)

  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
  File "segment/train.py", line 304, in train
        scaler.scale(loss).backward()Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 352, in backward
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    torch.distributed.all_reduce(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1288, in all_reduce
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
    work = group.allreduce([tensor], opts)
RuntimeError    : torch.distributed.all_reduce(NCCL communicator was aborted on rank 6.  Original reason for failure was: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800752 milliseconds before timing out.

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1288, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 7.  Original reason for failure was: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800895 milliseconds before timing out.
Traceback (most recent call last):
  File "segment/train.py", line 646, in <module>
    main(opt)
  File "segment/train.py", line 542, in main
    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 304, in train
    scaler.scale(loss).backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 352, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
    torch.distributed.all_reduce(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1288, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 5.  Original reason for failure was: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800849 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:361] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800895 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:361] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800752 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:361] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800849 milliseconds before timing out.
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
segment/train.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2024-07-13_16:32:42
  host      : 2d8ca0f6bf54
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 1299)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1299
[2]:
  time      : 2024-07-13_16:32:42
  host      : 2d8ca0f6bf54
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 1300)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1300
[3]:
  time      : 2024-07-13_16:32:42
  host      : 2d8ca0f6bf54
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 1301)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1301
[4]:
  time      : 2024-07-13_16:32:42
  host      : 2d8ca0f6bf54
  rank      : 4 (local_rank: 4)
  exitcode  : -6 (pid: 1302)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1302
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-13_16:32:42
  host      : 2d8ca0f6bf54
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 1298)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1298
=====================================================
root@2d8ca0f6bf54:/yolov9# Connection to ec2-3-85-20-244.compute-1.amazonaws.com closed by remote host.
Connection to ec2-3-85-20-244.compute-1.amazonaws.com closed.

I also tried with different instance types such as AWS p4d.24xlarge which is a 8GPU(A100), still the same error. Any pointer on this will be very helpful.

optimisticabouthumanity commented 2 months ago

In your train.py,

import timedelta:
```
from datetime import timedelta
```

increase the timeout for your distributed process group (in Line 525):

dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=18000))

Kindly let me know if above solution works for you. (Default timeout for NCCL backend is 30 mins. The above code I provided sets it to 5 hours.)

letdivedeep commented 2 months ago

@optimisticabouthumanity Thanks for your suggested changes, it helped not to get an timeout error next time.

But now when we resume the model training from the previously saved checkpoints, the training speed remains the same as earlier. Still, the validation steps take 10x more time to complete than the previous time (earlier it used to take only 9.50 mins to complete the validation step). Do you have any idea why such a thing may happen?

Is the validation process using all the GPUs as the training
Why we do batch_size=batch_size // WORLD_SIZE * 2 in validation and with train batch_size=batch_size // WORLD_SIZE
Also we see random text being put rather than numbers in mAP as highlighted

Screenshot 2024-07-23 at 10 17 42 PM

letdivedeep commented 2 months ago

@optimisticabouthumanity @WongKinYiu To add more to this issue, when we try to resume the model training with no_val = true, after resuming 2 epochs at 3epoch the model training gets stuck and doesn't proceed as seen in the screenshot below Screenshot 2024-07-24 at 11 32 01 AM

You can see when the training is stopped the CPU utilization is drop drastically. Screenshot 2024-07-24 at 11 35 47 AM

Also, I tried to use these prev stopped epochs as pre-trained weights by using --weight rather than --resume, but the issue persists.

I am using the AWS p4d.24xlarge instance for model training. This is my env:

CUDA version : 11.3
Pytorch version : 1.10.0
Python version : 3.7.11

@WongKinYiu @ws6125 @optimisticabouthumanity, Is this resume checkpoint a known issue?

letdivedeep commented 2 months ago

@WongKinYiu @optimisticabouthumanity YoloV9 DDP model training stuck after certain epoch is as bug in the training code base. Requesting you to please look into this

WongKinYiu / yolov9

WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801678 milliseconds before timing out. #539