Incorrect training loss when running across multiple GPUs (with ddp)

facebookresearch / fastMRI

A large-scale dataset of both raw MRI measurements and clinical MRI images.

https://fastmri.org

MIT License

1.3k stars 372 forks source link

Incorrect training loss when running across multiple GPUs (with ddp) #112

Closed vivekiyer closed 3 years ago

vivekiyer commented 3 years ago

We noticed that if we run training across multiple GPUs (with ddp enabled) the training loss that is printed seems to be incorrect, and does not decrease monotonically with each epoch. The same model when run on a single GPU shows monotonically decreasing loss. I have attached sample losses from a multiple GPU run and a single GPU run below. Any suggestions on where we should look to fix this?

results_multiplegpus.txt results_singlegpu.txt

mmuckley commented 3 years ago

I don't think there are any monotonicity guarantees for any of the stochastic gradient algorithms that we use. In this case my assumption for the single-GPU U-Net is that this happened due to random chance. I doubt that you would see monotonicity with any GPU arrangement for the VarNet model, which uses SSIM.

vivekiyer commented 3 years ago

Thanks for the response and the comment. Appreciate it.