Code stuck on "initalizing distributed" when using more than one gpu

jmonas commented 1 month ago

I'm able to run phase 1 training on a single gpu with nproc_per_node=1. However, distributed training gets stuck with nproc_per_node>1

I'm running below (with cuda 11.8):

      torchrun \
          --nnodes=1 \
          --nproc_per_node=2 \
          train.py \
          --base configs/training/vista_phase1.yaml \
          --num_nodes 1 \
          --n_devices 2

Code is hanging on:

[rank: 0] Global seed set to 23
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[rank: 1] Global seed set to 23
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

Any chance you've encountered this?

Little-Podi commented 1 month ago

Hi. Is there an error message? Did you take a look at the GPU usage to check if the program is still running? If it is not been interrupted, maybe you can wait for a while to see how it goes.

jmonas commented 1 month ago

Here's a code snippet with the output for a distributed run: https://codefile.io/f/O4diSKJuuN.

The run doesn't error or interrupt. The program is still "running", however, after waiting quite some time I'm fairly confident it is frozen.

With a single gpu run, it progresses to print the model config almost instantaneously:

[rank: 0] Global seed set to 23
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
You are using a CUDA device ('NVIDIA H100 80GB HBM3') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Setting up LambdaLR scheduler...
Rank: 0 partition count [1, 1] and sizes[(558941760, False), (967731882, False)] 
Project config
model:
  base_learning_rate: 1.0e-05
  target: vwm.models.diffusion.DiffusionEngine
  params:
    use_ema: true
    input_key: img_seq
    scale_factor: 0.18215
    disable_first_stage_autocast: true
    en_and_decode_n_samples_a_time: 1
    num_frames: 25
    slow_spatial_layers: true
    train_peft_adapters: false
    replace_cond_frames: true
    fixed_cond_frames:
    - - 0
      - 1
      - 2
    denoiser_config:
      target: vwm.modules.diffusionmodules.denoiser.Denoiser
      params:
        num_frames: 25
...

jmonas commented 1 month ago

Resolved now. Needed to specify --standalone flag for single node. Thank you for the help.

      torchrun \
        --standalone \
          --nnodes=1 \
          --nproc_per_node=2 \
          train.py \
          --base configs/training/vista_phase1.yaml \
          --num_nodes 1 \
          --n_devices 2

OpenDriveLab / Vista

Code stuck on "initalizing distributed" when using more than one gpu #22