Code stuck at All DDP processes registered

vatsalananthula commented 1 year ago

My code runs perpetually after

distributed_backend=nccl All DDP processes registered. Starting ddp with 2 processes

vatsalananthula commented 1 year ago

I was using 2 A6000s; switching to 2 A100s fixes the issue but its double the monetary cost. With A100s the model takes a while on this

libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'. libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'. /usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/datamodule.py:423: LightningDeprecationWarning: DataModule.setup has already been called, so it will not be called again. In v1.6 this behavior will change to always call DataModule.setup. rank_zero_deprecation( LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] Training the full unet Training the full unet Setting up LambdaLR scheduler... Setting up LambdaLR scheduler...

For GPU usage only 1 gpu is being 100% utilized while the other is at 0 if that might factor into it.

Running this same setup yesterday worked fine so not sure what changed between yesterday and today.

TrickyJustice commented 10 months ago

Were you able to solve this problem on A6000s?

LambdaLabsML / examples

Code stuck at All DDP processes registered #59

My code runs perpetually after

distributed_backend=nccl All DDP processes registered. Starting ddp with 2 processes