NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.56k stars 3.23k forks source link

[PyTorch/Segmentation/nnUNet] If multiple GPUs requested code will not run #1189

Open vijaypshah opened 2 years ago

vijaypshah commented 2 years ago

Related to Model/Framework(s) PyTorch/Segmentation/nnUNet

Describe the bug I am trying to run the example provided on the nnUNet. The code works fine when I use single GPU. However, if I request for 2 GPU it will not work. Following command works: python scripts/benchmark.py --mode train --gpus 1 --dim 3 --batch_size 2 --amp

Following command gets stuck python scripts/benchmark.py --mode train --gpus 2 --dim 3 --batch_size 2 --amp

387 training, 97 validation, 484 test examples Filters: [32, 64, 128, 256, 320, 320], Kernels: [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]] Strides: [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]] /opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called full_state_update that has not been set for this class (Dice). The property determines if update by default needs access to the full metric state. If this is not the case, significant speedups can be achieved and we recommend setting this to False. We provide an checking function from torchmetrics.utilities import check_forward_full_state_property that can be used to check if the full_state_update=True (old and potential slower behaviour, default for now) or if full_state_update=False can be used safely.

warnings.warn(*args, **kwargs) Using 16bit native Automatic Mixed Precision (AMP) Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default ModelSummary callback. GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:133: UserWarning: You defined a validation_step but have no val_dataloader. Skipping val loop. rank_zero_warn("You defined a validation_step but have no val_dataloader. Skipping val loop.") Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

To Reproduce Steps to reproduce the behavior:

  1. Install '...' : git clone https://github.com/NVIDIA/DeepLearningExamples cd DeepLearningExamples/PyTorch/Segmentation/nnUNet docker build -t nnunet . mkdir data results sudo singularity build nnunetMultiGPU.sif docker-daemon://nnunet:latest

  2. Launch : singularity shell --nv -B ${PWD}/data:/data -B ${PWD}/results:/results -B ${PWD}:/workspace nnunetMultiGPU.sif

Expected behavior Training to start as provided in the example

Environment Please provide at least:

michal2409 commented 2 years ago

Hi,

I've run the command for 2 GPUs and it works fine for me:

oot@6e38dc6f86a4:/workspace/nnunet_pyt# python scripts/benchmark.py --mode train --gpus 2 --dim 3 --batch_size 2 --amp
387 training, 97 validation, 484 test examples
Filters: [32, 64, 128, 256, 320, 320],
Kernels: [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]
Strides: [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]]
/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called `full_state_update` that has
                not been set for this class (Dice). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_full_state_property`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.

  warnings.warn(*args, **kwargs)
Using 16bit native Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:133: UserWarning: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
  rank_zero_warn("You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.")
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name               | Type             | Params
--------------------------------------------------------
0 | model              | DynUNet          | 31.2 M
1 | model.input_block  | UnetBasicBlock   | 31.2 K
2 | model.downsamples  | ModuleList       | 8.5 M 
3 | model.bottleneck   | UnetBasicBlock   | 5.5 M 
4 | model.upsamples    | ModuleList       | 17.2 M
5 | model.output_block | UnetOutBlock     | 132   
6 | model.skip_layers  | DynUNetSkipLayer | 31.2 M
7 | loss               | Loss             | 0     
8 | loss.loss_fn       | DiceCELoss       | 0     
9 | dice               | Dice             | 0     
--------------------------------------------------------
31.2 M    Trainable params
0         Non-trainable params
31.2 M    Total params
62.386    Total estimated model params size (MB)
Epoch 0    ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/150 0:00:23 • 0:01:14 2.00it/s loss: 2.81

I've found that might be PLT issue with some systems, please check https://github.com/Lightning-AI/lightning/issues/4612