Closed irowberryFS closed 3 months ago
I have been worried about corner cases where this optimizer may break behavior during multigpu training. I haven't seen this on my own runs, but it's hard to cover all the cases. Are you using batch-norm and the correction I mentioned in the docs for BN? Are the the train/eval calls to the optimizer applied on all gpus?
So I've done more debugging in my code. I don't believe it was the optimizer. The GPUs were getting different numbers of batches which was actually causing the timeouts. It just happened that my data loading process distributed an even number of batches the same run I switched the optimizer back to regular AdamW, leading me to believe that it was the optimizer.
@adefazio has this been tested with the various FSDP modes?
I haven't personally tested it, but I don't see any reason why it shouldn't work. Nobody has reported any issues.
Hi all, love this optimizer, it works great. However, I believe it may be breaking multi-gpu training set up though. I'm using Pytorch DDP. I don't have an error message as some GPUs move past the end of a training loop and some never finish, causing a timeout. Maybe something to do with gradients not syncing properly?