Potentially breaking DDP training

facebookresearch / schedule_free

Schedule-Free Optimization in PyTorch

Apache License 2.0

1.91k stars 65 forks source link

Potentially breaking DDP training #43

Closed irowberryFS closed 3 months ago

irowberryFS commented 3 months ago

Hi all, love this optimizer, it works great. However, I believe it may be breaking multi-gpu training set up though. I'm using Pytorch DDP. I don't have an error message as some GPUs move past the end of a training loop and some never finish, causing a timeout. Maybe something to do with gradients not syncing properly?

adefazio commented 3 months ago

I have been worried about corner cases where this optimizer may break behavior during multigpu training. I haven't seen this on my own runs, but it's hard to cover all the cases. Are you using batch-norm and the correction I mentioned in the docs for BN? Are the the train/eval calls to the optimizer applied on all gpus?

irowberryFS commented 3 months ago

So I've done more debugging in my code. I don't believe it was the optimizer. The GPUs were getting different numbers of batches which was actually causing the timeouts. It just happened that my data loading process distributed an even number of batches the same run I switched the optimizer back to regular AdamW, leading me to believe that it was the optimizer.

zaptrem commented 1 week ago

@adefazio has this been tested with the various FSDP modes?

adefazio commented 1 week ago

I haven't personally tested it, but I don't see any reason why it shouldn't work. Nobody has reported any issues.