Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.51k stars 3.39k forks source link

Training stuck at the first iter can't get corresponding pid #20367

Open yejr0229 opened 1 month ago

yejr0229 commented 1 month ago

Bug description

The DDP training stuck at the 1st iter, and it's always waiting for pid: image os.waitpid() always return pid==0

What version are you seeing the problem on?

v1.x

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

torch2.1.0+cuda12.1/11.8 pytorch-lightning==1.9.0/1.9.2 H100 x8

More info

I try to set limit_train_batches=0.1, limit_val_batches=1 in Trainer() but it doesn't work.

lantiga commented 1 week ago

Can you share a minimal repro?