Closed jmonas closed 1 month ago
Hi. Is there an error message? Did you take a look at the GPU usage to check if the program is still running? If it is not been interrupted, maybe you can wait for a while to see how it goes.
Here's a code snippet with the output for a distributed run: https://codefile.io/f/O4diSKJuuN.
The run doesn't error or interrupt. The program is still "running", however, after waiting quite some time I'm fairly confident it is frozen.
With a single gpu run, it progresses to print the model config almost instantaneously:
[rank: 0] Global seed set to 23
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
You are using a CUDA device ('NVIDIA H100 80GB HBM3') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Setting up LambdaLR scheduler...
Rank: 0 partition count [1, 1] and sizes[(558941760, False), (967731882, False)]
Project config
model:
base_learning_rate: 1.0e-05
target: vwm.models.diffusion.DiffusionEngine
params:
use_ema: true
input_key: img_seq
scale_factor: 0.18215
disable_first_stage_autocast: true
en_and_decode_n_samples_a_time: 1
num_frames: 25
slow_spatial_layers: true
train_peft_adapters: false
replace_cond_frames: true
fixed_cond_frames:
- - 0
- 1
- 2
denoiser_config:
target: vwm.modules.diffusionmodules.denoiser.Denoiser
params:
num_frames: 25
...
Resolved now. Needed to specify --standalone
flag for single node. Thank you for the help.
torchrun \
--standalone \
--nnodes=1 \
--nproc_per_node=2 \
train.py \
--base configs/training/vista_phase1.yaml \
--num_nodes 1 \
--n_devices 2
I'm able to run phase 1 training on a single gpu with
nproc_per_node=1
. However, distributed training gets stuck withnproc_per_node>1
I'm running below (with cuda 11.8):
Code is hanging on:
Any chance you've encountered this?