WongKinYiu / yolov9

Implementation of paper - YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
GNU General Public License v3.0
8.82k stars 1.37k forks source link

Muti GPU DDP issue : Expected to have finished reduction in the prior iteration before starting a new one. #559

Open letdivedeep opened 3 weeks ago

letdivedeep commented 3 weeks ago

Getting an RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. when running the yolov9 gelan segmentation model on aws p4d.24xlarge instance

below is the error logs :

    5/99        32G     0.2012     0.0821     0.1693     0.7552         40       1024:  15%|█▍        | 527/3625 08:41
       5/99        32G     0.2012    0.08208     0.1694     0.7552         48       1024:  15%|█▍        | 527/3625 08:42
       5/99        32G     0.2012    0.08208     0.1694     0.7552         48       1024:  15%|█▍        | 528/3625 08:42
       5/99        32G     0.2012    0.08213     0.1695     0.7552         51       1024:  15%|█▍        | 528/3625 08:43
       5/99        32G     0.2012    0.08213     0.1695     0.7552         51       1024:  15%|█▍        | 529/3625 08:43
       5/99        32G     0.2011    0.08215     0.1695     0.7552         47       1024:  15%|█▍        | 529/3625 08:44
       5/99        32G     0.2011    0.08215     0.1695     0.7552         47       1024:  15%|█▍        | 530/3625 08:44
       5/99        32G     0.2012    0.08234     0.1695     0.7552         48       1024:  15%|█▍        | 530/3625 08:45
       5/99        32G     0.2012    0.08234     0.1695     0.7552         48       1024:  15%|█▍        | 531/3625 08:45
       5/99        32G     0.2011    0.08249     0.1696     0.7551         48       1024:  15%|█▍        | 531/3625 08:45
       5/99        32G     0.2011    0.08249     0.1696     0.7551         48       1024:  15%|█▍        | 532/3625 08:45
       5/99        32G     0.2011    0.08246     0.1697     0.7551         65       1024:  15%|█▍        | 532/3625 08:46
       5/99        32G     0.2011    0.08246     0.1697     0.7551         65       1024:  15%|█▍        | 533/3625 08:46
       5/99        32G     0.2011    0.08249     0.1698     0.7551         56       1024:  15%|█▍        | 533/3625 08:47
       5/99        32G     0.2011    0.08249     0.1698     0.7551         56       1024:  15%|█▍        | 534/3625 08:47
       5/99        32G     0.2011    0.08244     0.1698     0.7551         53       1024:  15%|█▍        | 534/3625 08:48
       5/99        32G     0.2011    0.08244     0.1698     0.7551         53       1024:  15%|█▍        | 535/3625 08:48
       5/99        32G      0.201    0.08243     0.1697     0.7551         46       1024:  15%|█▍        | 535/3625 08:49
       5/99        32G      0.201    0.08243     0.1697     0.7551         46       1024:  15%|█▍        | 536/3625 08:49
       5/99        32G      0.201    0.08242     0.1697     0.7551         47       1024:  15%|█▍        | 536/3625 08:50
       5/99        32G      0.201    0.08242     0.1697     0.7551         47       1024:  15%|█▍        | 537/3625 08:50
       5/99        32G      0.201     0.0825     0.1696     0.7551         36       1024:  15%|█▍        | 537/3625 08:51
       5/99        32G      0.201     0.0825     0.1696     0.7551         36       1024:  15%|█▍        | 538/3625 08:51Traceback (most recent call last):
Traceback (most recent call last):
  File "segment/train.py", line 676, in <module>
  File "segment/train.py", line 676, in <module>
Traceback (most recent call last):
  File "segment/train.py", line 676, in <module>
Traceback (most recent call last):
      File "segment/train.py", line 676, in <module>
main(opt)    
main(opt)  File "segment/train.py", line 572, in main

main(opt)  File "segment/train.py", line 572, in main

  File "segment/train.py", line 572, in main
    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 316, in train
        train(opt.hyp, opt, device, callbacks)train(opt.hyp, opt, device, callbacks)

  File "segment/train.py", line 316, in train
  File "segment/train.py", line 316, in train
    pred = model(imgs)  # forward
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    pred = model(imgs)  # forward
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    pred = model(imgs)  # forward
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    main(opt)
  File "segment/train.py", line 572, in main
Traceback (most recent call last):
Traceback (most recent call last):
      File "segment/train.py", line 676, in <module>
  File "segment/train.py", line 676, in <module>
return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 316, in train
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    pred = model(imgs)  # forward
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 3: model.22.cv4.2.2.bias, model.22.cv4.2.2.weight, model.22.cv4.2.1.bn.bias, model.22.cv4.2.1.bn.weight, model.22.cv4.2.1.conv.weight, model.22.cv4.2.0.bn.bias, model.22.cv4.2.0.bn.weight, model.22.cv4.2.0.conv.weight, model.22.cv4.1.2.bias, model.22.cv4.1.2.weight, model.22.cv4.1.1.bn.bias, model.22.cv4.1.1.bn.weight, model.22.cv4.1.1.conv.weight, model.22.cv4.1.0.bn.bias, model.22.cv4.1.0.bn.weight, model.22.cv4.1.0.conv.weight, model.22.cv4.0.2.bias, model.22.cv4.0.2.weight, model.22.cv4.0.1.bn.bias, model.22.cv4.0.1.bn.weight, model.22.cv4.0.1.conv.weight, model.22.cv4.0.0.bn.bias, model.22.cv4.0.0.bn.weight, model.22.cv4.0.0.conv.weight, model.22.proto.cv3.bn.bias, model.22.proto.cv3.bn.weight, model.22.proto.cv3.conv.weight, model.22.proto.cv2.bn.bias, model.22.proto.cv2.bn.weight, model.22.proto.cv2.conv.weight, model.22.proto.cv1.bn.bias, model.22.proto.cv1.bn.weight, model.22.proto.cv1.conv.weight
Parameter indices which did not receive grad for rank 3: 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 2: model.22.cv4.2.2.bias, model.22.cv4.2.2.weight, model.22.cv4.2.1.bn.bias, model.22.cv4.2.1.bn.weight, model.22.cv4.2.1.conv.weight, model.22.cv4.2.0.bn.bias, model.22.cv4.2.0.bn.weight, model.22.cv4.2.0.conv.weight, model.22.cv4.1.2.bias, model.22.cv4.1.2.weight, model.22.cv4.1.1.bn.bias, model.22.cv4.1.1.bn.weight, model.22.cv4.1.1.conv.weight, model.22.cv4.1.0.bn.bias, model.22.cv4.1.0.bn.weight, model.22.cv4.1.0.conv.weight, model.22.cv4.0.2.bias, model.22.cv4.0.2.weight, model.22.cv4.0.1.bn.bias, model.22.cv4.0.1.bn.weight, model.22.cv4.0.1.conv.weight, model.22.cv4.0.0.bn.bias, model.22.cv4.0.0.bn.weight, model.22.cv4.0.0.conv.weight, model.22.proto.cv3.bn.bias, model.22.proto.cv3.bn.weight, model.22.proto.cv3.conv.weight, model.22.proto.cv2.bn.bias, model.22.proto.cv2.bn.weight, model.22.proto.cv2.conv.weight, model.22.proto.cv1.bn.bias, model.22.proto.cv1.bn.weight, model.22.proto.cv1.conv.weight
Parameter indices which did not receive grad for rank 2: 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 5: model.22.cv4.2.2.bias, model.22.cv4.2.2.weight, model.22.cv4.2.1.bn.bias, model.22.cv4.2.1.bn.weight, model.22.cv4.2.1.conv.weight, model.22.cv4.2.0.bn.bias, model.22.cv4.2.0.bn.weight, model.22.cv4.2.0.conv.weight, model.22.cv4.1.2.bias, model.22.cv4.1.2.weight, model.22.cv4.1.1.bn.bias, model.22.cv4.1.1.bn.weight, model.22.cv4.1.1.conv.weight, model.22.cv4.1.0.bn.bias, model.22.cv4.1.0.bn.weight, model.22.cv4.1.0.conv.weight, model.22.cv4.0.2.bias, model.22.cv4.0.2.weight, model.22.cv4.0.1.bn.bias, model.22.cv4.0.1.bn.weight, model.22.cv4.0.1.conv.weight, model.22.cv4.0.0.bn.bias, model.22.cv4.0.0.bn.weight, model.22.cv4.0.0.conv.weight, model.22.proto.cv3.bn.bias, model.22.proto.cv3.bn.weight, model.22.proto.cv3.conv.weight, model.22.proto.cv2.bn.bias, model.22.proto.cv2.bn.weight, model.22.proto.cv2.conv.weight, model.22.proto.cv1.bn.bias, model.22.proto.cv1.bn.weight, model.22.proto.cv1.conv.weight
Parameter indices which did not receive grad for rank 5: 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506        
main(opt)main(opt)
  File "segment/train.py", line 572, in main

  File "segment/train.py", line 572, in main
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    train(opt.hyp, opt, device, callbacks)
    train(opt.hyp, opt, device, callbacks)  File "segment/train.py", line 316, in train

  File "segment/train.py", line 316, in train
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
    pred = model(imgs)  # forward    
RuntimeErrorpred = model(imgs)  # forward  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
: 
Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 4: model.22.cv4.2.2.bias, model.22.cv4.2.2.weight, model.22.cv4.2.1.bn.bias, model.22.cv4.2.1.bn.weight, model.22.cv4.2.1.conv.weight, model.22.cv4.2.0.bn.bias, model.22.cv4.2.0.bn.weight, model.22.cv4.2.0.conv.weight, model.22.cv4.1.2.bias, model.22.cv4.1.2.weight, model.22.cv4.1.1.bn.bias, model.22.cv4.1.1.bn.weight, model.22.cv4.1.1.conv.weight, model.22.cv4.1.0.bn.bias, model.22.cv4.1.0.bn.weight, model.22.cv4.1.0.conv.weight, model.22.cv4.0.2.bias, model.22.cv4.0.2.weight, model.22.cv4.0.1.bn.bias, model.22.cv4.0.1.bn.weight, model.22.cv4.0.1.conv.weight, model.22.cv4.0.0.bn.bias, model.22.cv4.0.0.bn.weight, model.22.cv4.0.0.conv.weight, model.22.proto.cv3.bn.bias, model.22.proto.cv3.bn.weight, model.22.proto.cv3.conv.weight, model.22.proto.cv2.bn.bias, model.22.proto.cv2.bn.weight, model.22.proto.cv2.conv.weight, model.22.proto.cv1.bn.bias, model.22.proto.cv1.bn.weight, model.22.proto.cv1.conv.weight
Parameter indices which did not receive grad for rank 4: 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl

Traceback (most recent call last):
  File "segment/train.py", line 676, in <module>
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
            main(opt)if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():

  File "segment/train.py", line 572, in main
RuntimeErrorRuntimeError: : Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 7: model.22.cv4.2.2.bias, model.22.cv4.2.2.weight, model.22.cv4.2.1.bn.bias, model.22.cv4.2.1.bn.weight, model.22.cv4.2.1.conv.weight, model.22.cv4.2.0.bn.bias, model.22.cv4.2.0.bn.weight, model.22.cv4.2.0.conv.weight, model.22.cv4.1.2.bias, model.22.cv4.1.2.weight, model.22.cv4.1.1.bn.bias, model.22.cv4.1.1.bn.weight, model.22.cv4.1.1.conv.weight, model.22.cv4.1.0.bn.bias, model.22.cv4.1.0.bn.weight, model.22.cv4.1.0.conv.weight, model.22.cv4.0.2.bias, model.22.cv4.0.2.weight, model.22.cv4.0.1.bn.bias, model.22.cv4.0.1.bn.weight, model.22.cv4.0.1.conv.weight, model.22.cv4.0.0.bn.bias, model.22.cv4.0.0.bn.weight, model.22.cv4.0.0.conv.weight, model.22.proto.cv3.bn.bias, model.22.proto.cv3.bn.weight, model.22.proto.cv3.conv.weight, model.22.proto.cv2.bn.bias, model.22.proto.cv2.bn.weight, model.22.proto.cv2.conv.weight, model.22.proto.cv1.bn.bias, model.22.proto.cv1.bn.weight, model.22.proto.cv1.conv.weight
Parameter indices which did not receive grad for rank 7: 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 1: model.22.cv4.2.2.bias, model.22.cv4.2.2.weight, model.22.cv4.2.1.bn.bias, model.22.cv4.2.1.bn.weight, model.22.cv4.2.1.conv.weight, model.22.cv4.2.0.bn.bias, model.22.cv4.2.0.bn.weight, model.22.cv4.2.0.conv.weight, model.22.cv4.1.2.bias, model.22.cv4.1.2.weight, model.22.cv4.1.1.bn.bias, model.22.cv4.1.1.bn.weight, model.22.cv4.1.1.conv.weight, model.22.cv4.1.0.bn.bias, model.22.cv4.1.0.bn.weight, model.22.cv4.1.0.conv.weight, model.22.cv4.0.2.bias, model.22.cv4.0.2.weight, model.22.cv4.0.1.bn.bias, model.22.cv4.0.1.bn.weight, model.22.cv4.0.1.conv.weight, model.22.cv4.0.0.bn.bias, model.22.cv4.0.0.bn.weight, model.22.cv4.0.0.conv.weight, model.22.proto.cv3.bn.bias, model.22.proto.cv3.bn.weight, model.22.proto.cv3.conv.weight, model.22.proto.cv2.bn.bias, model.22.proto.cv2.bn.weight, model.22.proto.cv2.conv.weight, model.22.proto.cv1.bn.bias, model.22.proto.cv1.bn.weight, model.22.proto.cv1.conv.weight
Parameter indices which did not receive grad for rank 1: 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506

    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 316, in train
    pred = model(imgs)  # forward
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 6: model.22.cv4.2.2.bias, model.22.cv4.2.2.weight, model.22.cv4.2.1.bn.bias, model.22.cv4.2.1.bn.weight, model.22.cv4.2.1.conv.weight, model.22.cv4.2.0.bn.bias, model.22.cv4.2.0.bn.weight, model.22.cv4.2.0.conv.weight, model.22.cv4.1.2.bias, model.22.cv4.1.2.weight, model.22.cv4.1.1.bn.bias, model.22.cv4.1.1.bn.weight, model.22.cv4.1.1.conv.weight, model.22.cv4.1.0.bn.bias, model.22.cv4.1.0.bn.weight, model.22.cv4.1.0.conv.weight, model.22.cv4.0.2.bias, model.22.cv4.0.2.weight, model.22.cv4.0.1.bn.bias, model.22.cv4.0.1.bn.weight, model.22.cv4.0.1.conv.weight, model.22.cv4.0.0.bn.bias, model.22.cv4.0.0.bn.weight, model.22.cv4.0.0.conv.weight, model.22.proto.cv3.bn.bias, model.22.proto.cv3.bn.weight, model.22.proto.cv3.conv.weight, model.22.proto.cv2.bn.bias, model.22.proto.cv2.bn.weight, model.22.proto.cv2.conv.weight, model.22.proto.cv1.bn.bias, model.22.proto.cv1.bn.weight, model.22.proto.cv1.conv.weight
Parameter indices which did not receive grad for rank 6: 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506

       5/99        32G     0.2007    0.08235        nan     0.7537         43       1024:  15%|█▍        | 538/3625 08:52
       5/99        32G     0.2007    0.08235        nan     0.7537         43       1024:  15%|█▍        | 539/3625 08:52
       5/99        32G     0.2007    0.08235        nan     0.7537         43       1024:  15%|█▍        | 539/3625 08:52
Traceback (most recent call last):
  File "segment/train.py", line 676, in <module>
    main(opt)
  File "segment/train.py", line 572, in main
    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 316, in train
    pred = model(imgs)  # forward
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 0: model.22.cv4.2.2.bias, model.22.cv4.2.2.weight, model.22.cv4.2.1.bn.bias, model.22.cv4.2.1.bn.weight, model.22.cv4.2.1.conv.weight, model.22.cv4.2.0.bn.bias, model.22.cv4.2.0.bn.weight, model.22.cv4.2.0.conv.weight, model.22.cv4.1.2.bias, model.22.cv4.1.2.weight, model.22.cv4.1.1.bn.bias, model.22.cv4.1.1.bn.weight, model.22.cv4.1.1.conv.weight, model.22.cv4.1.0.bn.bias, model.22.cv4.1.0.bn.weight, model.22.cv4.1.0.conv.weight, model.22.cv4.0.2.bias, model.22.cv4.0.2.weight, model.22.cv4.0.1.bn.bias, model.22.cv4.0.1.bn.weight, model.22.cv4.0.1.conv.weight, model.22.cv4.0.0.bn.bias, model.22.cv4.0.0.bn.weight, model.22.cv4.0.0.conv.weight, model.22.proto.cv3.bn.bias, model.22.proto.cv3.bn.weight, model.22.proto.cv3.conv.weight, model.22.proto.cv2.bn.bias, model.22.proto.cv2.bn.weight, model.22.proto.cv2.conv.weight, model.22.proto.cv1.bn.bias, model.22.proto.cv1.bn.weight, model.22.proto.cv1.conv.weight
Parameter indices which did not receive grad for rank 0: 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3442406 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3442403) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
segment/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-08-15_11:44:45
  host      : d4f275498371
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3442404)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-08-15_11:44:45
  host      : d4f275498371
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 3442405)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-08-15_11:44:45
  host      : d4f275498371
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 3442407)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-08-15_11:44:45
  host      : d4f275498371
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 3442408)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-08-15_11:44:45
  host      : d4f275498371
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 3442409)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-08-15_11:44:45
  host      : d4f275498371
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 3442410)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-15_11:44:45
  host      : d4f275498371
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3442403)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

error_log.txt

I even tried with using 'find_unused_parameters=True' with that the model get stuck

Yerma-bit commented 3 weeks ago

same issue , could you solve that ?

letdivedeep commented 3 weeks ago

@Yerma-bit No i was not able to solve it yet. @WongKinYiu @ws6125 any feedback you can provide on this

Yerma-bit commented 3 weeks ago

@letdivedeep still have the same issue. tried yolov9 repo and also Ultralytics yolov9 too, the problem is similiar

letdivedeep commented 3 weeks ago

@Yerma-bit have you raise this issue with Ultralytics, If so can you point me to the thread ?

Yerma-bit commented 3 weeks ago

@letdivedeep not yet, but i found out that the problem could be with cuda. I have failed with cuda 12.2 but then tried on other machine with Cuda 11.6 and the multi gpu training worked well.

letdivedeep commented 3 weeks ago

@Yerma-bit That great to know, can you share your environment details

Yerma-bit commented 3 weeks ago

@letdivedeep Sure : NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 and torch 1.13cu11.6 Also, do not forget to clean your dataset cache