Open yejr0229 opened 1 month ago
The DDP training stuck at the 1st iter, and it's always waiting for pid: os.waitpid() always return pid==0
v1.x
No response
# Error messages and logs here please
torch2.1.0+cuda12.1/11.8 pytorch-lightning==1.9.0/1.9.2 H100 x8
I try to set limit_train_batches=0.1, limit_val_batches=1 in Trainer() but it doesn't work.
Can you share a minimal repro?
Bug description
The DDP training stuck at the 1st iter, and it's always waiting for pid: os.waitpid() always return pid==0
What version are you seeing the problem on?
v1.x
How to reproduce the bug
No response
Error messages and logs
Environment
torch2.1.0+cuda12.1/11.8 pytorch-lightning==1.9.0/1.9.2 H100 x8
More info
I try to set limit_train_batches=0.1, limit_val_batches=1 in Trainer() but it doesn't work.