Running on system with multiple GPU fails (ddp failure)

koush commented 1 week ago

Describe the bug

Running on system with multiple GPU fails.

To Reproduce

Setup system with 2 GPUs.

Run the training command:

python yolo/lazy.py task=train dataset=coco use_wandb=True

(.venv) (base) koush@koushik-ubuntuvm:~/yolov9mit/YOLO$ python yolo/lazy.py task=train dataset=coco use_wandb=True
[11/07/24 07:18:44] INFO     📄 Created log folder: runs/train/v9-dev                                                                                     logging_utils.py:319
[11/07/24 07:18:44] INFO     ⚡ Using 16bit Automatic Mixed Precision (AMP)                                                                                    rank_zero.py:63
                    INFO     ⚡ Trainer already configured with model summary callbacks: [<class 'yolo.utils.logging_utils.YOLORichModelSummary'>]. Skipping   rank_zero.py:63
                             setting a default `ModelSummary` callback.                                                                                                       
                    INFO     ⚡ GPU available: True (cuda), used: True                                                                                         rank_zero.py:63
                    INFO     ⚡ TPU available: False, using: 0 TPU cores                                                                                       rank_zero.py:63
                    INFO     ⚡ HPU available: False, using: 0 HPUs                                                                                            rank_zero.py:63
                    INFO     🚜 Building YOLO                                                                                                                       yolo.py:35
                    INFO       🏗  Building backbone                                                                                                                 yolo.py:38
                    INFO       🏗  Building neck                                                                                                                     yolo.py:38
                    INFO       🏗  Building head                                                                                                                     yolo.py:38
                    INFO       🏗  Building detection                                                                                                                yolo.py:38
                    INFO       🏗  Building auxiliary                                                                                                                yolo.py:38
                    INFO     ✅ Success load model & weight                                                                                                        yolo.py:177
                    INFO     ✅ Dataset val2017      already verified.                                                                               dataset_preparation.py:74
                    INFO     ✅ Dataset annotations  already verified.                                                                               dataset_preparation.py:74
[11/07/24 07:18:45] INFO     📦 Loaded val2017 cache                                                                                                         data_loader.py:60
                    INFO     ✅ Dataset train2017    already verified.                                                                               dataset_preparation.py:74
                    INFO     ✅ Dataset annotations  already verified.                                                                               dataset_preparation.py:74
[11/07/24 07:18:51] INFO     📦 Loaded train2017 cache                                                                                                       data_loader.py:60
[11/07/24 07:18:53] INFO     ⚡ You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set        rank_zero.py:63
                             `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read                   
                             https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision                             
[11/07/24 07:18:53] INFO     ⚡ Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2                                                                       distributed.py:296
[11/07/24 07:19:05] INFO     ⚡ Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2                                                                       distributed.py:296
[11/07/24 07:19:05] INFO     ⚡ ----------------------------------------------------------------------------------------------------                           rank_zero.py:63
                             distributed_backend=nccl                                                                                                                         
                             All distributed processes registered. Starting with 2 processes                                                                                  
                             ----------------------------------------------------------------------------------------------------                                             

[11/07/24 07:19:05] INFO     🌐 Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.                   logging_utils.py:250
[11/07/24 07:19:06] INFO     🌐 Currently logged in as: koushd (koushd-scrypted). Use `wandb login --relogin` to force relogin                            logging_utils.py:250
                    INFO     🌐 Tracking run with wandb version 0.18.6                                                                                    logging_utils.py:250
                    INFO     🌐 Run data is saved locally in runs/train/v9-dev/wandb/run-20241107_071906-8qyi2ify                                         logging_utils.py:250
                    INFO     🌐 Run `wandb offline` to turn off syncing.                                                                                  logging_utils.py:250
                    INFO     🌐 Syncing run v9-dev                                                                                                        logging_utils.py:250
                    INFO     🌐 ⭐️ View project at https://wandb.ai/koushd-scrypted/YOLO                                                                  logging_utils.py:250
                    INFO     🌐 🚀 View run at https://wandb.ai/koushd-scrypted/YOLO/runs/8qyi2ify                                                        logging_utils.py:250
                    INFO     🈶 Found stride of model [8, 16, 32]                                                                                    bounding_box_utils.py:304
                    INFO     ✅ Success load loss function                                                                                               loss_functions.py:141
[11/07/24 07:19:06] INFO     ⚡ LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]                                                                                         cuda.py:61
[11/07/24 07:19:06] INFO     ⚡ LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]                                                                                         cuda.py:61
┏━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃   ┃ Name   ┃ Type                 ┃ Params ┃ Mode  ┃
┡━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ 0 │ model  │ YOLO                 │ 51.2 M │ train │
│ 1 │ metric │ MeanAveragePrecision │      0 │ train │
└───┴────────┴──────────────────────┴────────┴───────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Attributes                             ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Trainable params                       │ 51.2 M     │
│ Non-trainable params                   │ 96         │
│ Total params                           │ 51.2 M     │
│ Total estimated model params size (MB) │ 204        │
│ Modules in train mode                  │ 1222       │
│ Modules in eval mode                   │ 0          │
└────────────────────────────────────────┴────────────┘

Error executing job with overrides: ['task=train', 'dataset=coco', 'use_wandb=True']
Traceback (most recent call last):
  File "/home/koush/yolov9mit/YOLO/yolo/lazy.py", line 35, in main
    trainer.fit(model)
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage
    self._run_sanity_check()
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 1052, in _run_sanity_check
    val_loop.run()
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/loops/utilities.py", line 178, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 128, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
                                       ^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
            ^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
            ^^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/lightning/pytorch/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1465, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1491, in _process_data
    data.reraise()
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/torch/_utils.py", line 715, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/koush/yolov9mit/YOLO/yolo/tools/data_loader.py", line 207, in collate_fn
    batch_images = torch.stack(batch_images)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [3, 256, 1024] at entry 0 and [3, 416, 864] at entry 16

Expected behavior

Training proceeds.

Screenshots

If applicable, add screenshots to help explain your problem.

System Info (please complete the following ## information):

OS: [e.g. Ubuntu 20.04] 22.04
Python Version: [e.g. 3.8] 3.10
PyTorch Version: [e.g. 1.8.1]2.5.1
CUDA/cuDNN/MPS Version: [e.g. CUDA 11.1] 12.6
YOLO Model Version: [e.g. YOLOv9-c] 9c

Additioal Nodes

Modifying lazy.py to use num_devices=1, num_nodes=1 works as expected. DDP is failing.

henrytsui000 commented 1 week ago

Hi,

Could you please tell me your Git commit version? I recently fixed a very similar issue in commit 309271089a6f916197e4e7977f77738ba1521bfb.

Best regards, Henry Tsui

koush commented 1 week ago

commit 2522f723d0db5c72a6e49a7331b844290ef0af34 (HEAD -> main, origin/main, origin/TEST, origin/HEAD)
Author: henrytsui000 <henrytsui000@gmail.com>
Date:   Tue Nov 5 14:43:04 2024 +0800

    ✅ [Pass] test in multiclass label&dynamic shape

koush commented 6 days ago

Updated to 959b9b05667f6b9a1f349bc2c9843d039e405f60, issue persists.

henrytsui000 commented 6 days ago

Hi,

Can you try turning off the dynamic_shape setting in yolo/config/task/validation.yaml? You can do this by modifying the configuration as follows:

task: validation

data:
  ...
  dynamic_shape: False
...

Alternatively, you can disable it during training with the following command:

python yolo/lazy.py task=train ... task.validation.data.dynamic_shape=False

I suspect the issue is caused by the sampler and the dynamic_shape setting. Turning it off will disable the auto-adjustment of the input shape in the validation phase. While this might result in a slightly lower mAP, it will enable multiple GPU validation.

If you need a higher-performance model, you can perform validation after training using a single GPU after training—or you can wait for me to find time to fix this properly.

Best regards,
Henry Tsui

koush commented 6 days ago

That extra command line parameter seems to have suppressed the issue.

koush commented 6 days ago

Training completes 1 epoch, performs the validation step seemingly with no error, and then hangs.

WongKinYiu / YOLO