Difference between increasing batch size and using update frequency for NMT

joeforan76 commented 3 years ago

What is your question?

Assuming GPU memory is not an issue, should be there a difference between the following scenarios?	Scenario	#Gpus	Batch size
A	4	6400	1
B	1	6400	4
C	1	25600	1

My understanding is that all three should be equivalent, but I found that while I am able to train to convergence with similar results under Scenarios A and B, when I try Scenario C, I get exploding losses on the third epoch. I tried varying the random seed to see if it was by chance, but I consistently got the same result. All other hyperparameters are kept the same. As Scenario C gives me better utilisation of GPU memory, I would like to be able to use it. Perhaps by adjusting the learning rate and other hyperparameters, I can get it to converge, but firstly, I'd like to understand why it behaves differently than the other two setups.

The log output just before failure for Scenario C is

2021-06-17 19:31:55 | INFO | train_inner | epoch 003:   4337 / 76693 loss=3.484, nll_loss=1.628, ppl=3.09, wps=91261.6, ups=3.72, wpb=24516.2, bsz=1127, num_updates=157700, lr=9.00926e-05, gnorm=3.132, loss_scale=0.0001, train_wall=27, wall=43731
2021-06-17 19:32:23 | INFO | train_inner | epoch 003:   4437 / 76693 loss=6.173, nll_loss=4.325, ppl=20.05, wps=87453.8, ups=3.59, wpb=24393.4, bsz=1174, num_updates=157800, lr=9.00641e-05, gnorm=27.876, loss_scale=0.0001, train_wall=28, wall=43759
2021-06-17 19:32:43 | INFO | fairseq.nan_detector | Detected nan/inf grad norm, dumping norms...

whereas for approximately the equivalent step under Scenario B it is

2021-06-18 21:47:23 | INFO | train_inner | epoch 003:   4357 / 78289 loss=2.765, nll_loss=1.23, ppl=2.35, wps=57589, ups=2.4, wpb=24035.5, bsz=1079.6, num_updates=160900, lr=8.91922e-05, gnorm=0.198, loss_scale=16, train_wall=41, wall=67489
2021-06-18 21:48:05 | INFO | train_inner | epoch 003:   4457 / 78289 loss=2.752, nll_loss=1.216, ppl=2.32, wps=57766.5, ups=2.4, wpb=24020.4, bsz=1135.1, num_updates=161000, lr=8.91645e-05, gnorm=0.194, loss_scale=16, train_wall=41, wall=67531

Note that the loss_scales and gnorms are vastly different. Under Scenario C the loss scale goes to a very low value very quickly, whereas for the other scenarios, it oscillates between values of 8.0 and 16.0

What's your environment?

docker image: pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel
fairseq Version v0.10.2
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install --editable ./
GPU models and configuration: 4 x Telsa V100-PCIE-32GB

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

joeforan76 commented 2 years ago

bump

facebookresearch / fairseq

Difference between increasing batch size and using update frequency for NMT #3633

What is your question?

What's your environment?