Open he-y opened 4 years ago
Recently I use distribute train more often. You need to make sure single gpu has same batch size with me, you should get same result but may take more time if you have less gpu.
Thanks for your reply. I understand that a single gpu should has the same batch size (128) as yours. I have a question about the learning rate. Does the learning rate need to be changed?
Based on the Linear Scaling Rule in the paper(Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour), the learning rate should be changed according to the batch size.
Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.
Thank you very much!
No need to change I think. This paper should mean batch size on one device, normally batch size in paper just mean on device hold, take care of the difference between DistributedDataParallel
and DataParallel
.
Thanks for your great work! Could you please share the influence of the batch size and the number of GPUs? Also how to choose a suitable learning rate and batch size if the available GPUs is not enough. Thank you!