Learning rate for multi-GPUs training

PiotrNawrot / nanoT5

Fast & Simple repository for pre-training and fine-tuning T5-style models

Apache License 2.0

970 stars 74 forks source link

Learning rate for multi-GPUs training #34

Closed phucdoitoan closed 7 months ago

phucdoitoan commented 7 months ago

Hi, if you keep the batch_size = 128 and try to train with multiple GPUs, e.g. 8 GPUs, the effective batch_size is 128*8 = 1024, do you have any idea how to set the learning rate in that case?

I have tried to change the learning rate, for example scale it linearly w.r.t num GPU => (lr = 1 GPU's lr * 8) etc, but only to receive a trained model with worse negative log-likelihood.

So far, I only consider optim = adafactor and scheduler = legacy though.

PiotrNawrot commented 7 months ago

Hey, common strategies are to either increase the BS linearly or quadratically compared to LR -> BS * x -> LR * sqrt(x). Besides that I don't have any clear recommendation. You can always lower the LR to match the default config despite more devices, but I encourage you to try a few more LRs, maybe LR2 for BS8

phucdoitoan commented 7 months ago

Hi, thank you for the quick response. By BS * x -> LR * sqrt(x) , do you mean if batch size increases by x times, then learning rate increases by sqrt(x) times? And by " LR2 for BS8", you mean to double the lr for when batch size increases by 8 times?

PiotrNawrot commented 7 months ago

Yeah, that's what I meant in both cases