facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.17k stars 6.37k forks source link

Difference between increasing batch size and using update frequency for NMT #3633

Open joeforan76 opened 3 years ago

joeforan76 commented 3 years ago

What is your question?

Assuming GPU memory is not an issue, should be there a difference between the following scenarios? Scenario #Gpus Batch size Update
Frequency
A 4 6400 1
B 1 6400 4
C 1 25600 1

My understanding is that all three should be equivalent, but I found that while I am able to train to convergence with similar results under Scenarios A and B, when I try Scenario C, I get exploding losses on the third epoch. I tried varying the random seed to see if it was by chance, but I consistently got the same result. All other hyperparameters are kept the same. As Scenario C gives me better utilisation of GPU memory, I would like to be able to use it. Perhaps by adjusting the learning rate and other hyperparameters, I can get it to converge, but firstly, I'd like to understand why it behaves differently than the other two setups.

The log output just before failure for Scenario C is

2021-06-17 19:31:55 | INFO | train_inner | epoch 003:   4337 / 76693 loss=3.484, nll_loss=1.628, ppl=3.09, wps=91261.6, ups=3.72, wpb=24516.2, bsz=1127, num_updates=157700, lr=9.00926e-05, gnorm=3.132, loss_scale=0.0001, train_wall=27, wall=43731
2021-06-17 19:32:23 | INFO | train_inner | epoch 003:   4437 / 76693 loss=6.173, nll_loss=4.325, ppl=20.05, wps=87453.8, ups=3.59, wpb=24393.4, bsz=1174, num_updates=157800, lr=9.00641e-05, gnorm=27.876, loss_scale=0.0001, train_wall=28, wall=43759
2021-06-17 19:32:43 | INFO | fairseq.nan_detector | Detected nan/inf grad norm, dumping norms...

whereas for approximately the equivalent step under Scenario B it is

2021-06-18 21:47:23 | INFO | train_inner | epoch 003:   4357 / 78289 loss=2.765, nll_loss=1.23, ppl=2.35, wps=57589, ups=2.4, wpb=24035.5, bsz=1079.6, num_updates=160900, lr=8.91922e-05, gnorm=0.198, loss_scale=16, train_wall=41, wall=67489
2021-06-18 21:48:05 | INFO | train_inner | epoch 003:   4457 / 78289 loss=2.752, nll_loss=1.216, ppl=2.32, wps=57766.5, ups=2.4, wpb=24020.4, bsz=1135.1, num_updates=161000, lr=8.91645e-05, gnorm=0.194, loss_scale=16, train_wall=41, wall=67531

Note that the loss_scales and gnorms are vastly different. Under Scenario C the loss scale goes to a very low value very quickly, whereas for the other scenarios, it oscillates between values of 8.0 and 16.0

What's your environment?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

joeforan76 commented 2 years ago

bump