Closed sritee closed 4 years ago
@sritee Yes, It will only result in different learning rates. Because I have tried it with both sum
and average
. I found sum
can achieve better results. From my own opinion (maybe not correct) - when we sum gradients from each MPI workers, we can get "strong" update direction (you can also think it's a process of denoising). In this case, we can use "large" learning rate to accelerate the training.
Why do you sum rather than average the gradients in sync_grads? Won't this result in different learning rates when you run different number of processes?