relationship between BatchSize and LR?

airsplay / lxmert

PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

MIT License

923 stars 157 forks source link

relationship between BatchSize and LR? #77

Open XChuanLee opened 4 years ago

XChuanLee commented 4 years ago

Very thankful for this great code and sorry for the bothering.

During the pre-training stage, I want to set the batchsize much bigger for a quiker convergence, when I found that simply timing the learning rate cannot reach the same performance. Have you meet the same problem? How do you handle the realationship between LR and BatchSize?

Thanks again!

airsplay commented 4 years ago

I used up to a batch size of 256 (due to resource limitation TAT). Here are some related references that I could think about:

ImageNet in 1 hour suggests the "Linear Scaling Rule": When the minibatch size is multiplied by k, multiply the learning rate by k. See Sec 2.1 for details.
The more recent paper BERT in 76 mins suggests a new optimizer called LAMB which might be worth to try? An implementation is available here but I did not get a change to use it yet.

XChuanLee commented 4 years ago

Thanks so much.

BTW, It is solved when I reduce the learning rate, but maybe more experiments are needed to get the best performance.

Thanks~

XChuanLee commented 4 years ago

Sorry for bothering again. I have rerun the code with same lr and bs. But I found the fine-tuned model via vqa does't reach the high performance as your proposed model. I only got 69.5, but my pre-trained loss is lower than 4.6.

???????