Closed phucdoitoan closed 7 months ago
Hey, common strategies are to either increase the BS linearly or quadratically compared to LR -> BS * x -> LR * sqrt(x)
.
Besides that I don't have any clear recommendation. You can always lower the LR to match the default config despite more devices, but I encourage you to try a few more LRs, maybe LR2 for BS8
Hi, thank you for the quick response.
By BS * x -> LR * sqrt(x)
, do you mean if batch size increases by x times, then learning rate increases by sqrt(x) times?
And by " LR2 for BS8", you mean to double the lr for when batch size increases by 8 times?
Yeah, that's what I meant in both cases
Hi, if you keep the batch_size = 128 and try to train with multiple GPUs, e.g. 8 GPUs, the effective batch_size is 128*8 = 1024, do you have any idea how to set the learning rate in that case?
I have tried to change the learning rate, for example scale it linearly w.r.t num GPU => (lr = 1 GPU's lr * 8) etc, but only to receive a trained model with worse negative log-likelihood.
So far, I only consider optim = adafactor and scheduler = legacy though.