nan loss / ppl when training blstm model

nuance commented 7 years ago

I'm training a set of translation models using the suggested fconv parameters (but the model switched to blstm):

fairseq train -sourcelang en -targetlang fr -datadir data/fairseq/en-fr -model blstm -nenclayer 4 -nlayer 3 -dropout 0.2 -optim nag -lr 0.25 -clip 0.1 -momentum 0.99 -timeavg -bptt 0 -savedir data/fairseq/en-fr.blstm -batchsize 16 | tee train..blstm.log

I'm seeing loss and perplexity become nan after a few epochs:

| epoch 000 | 0001000 updates | words/s    4328| trainloss     8.72 | train ppl   420.34
| epoch 000 | 0002000 updates | words/s    4559| trainloss     6.91 | train ppl   120.29
| checkpoint 001 | epoch 001 | 0002645 updates | s/checkpnt     767 | words/s    4461 | lr 0.250000
| checkpoint 001 | epoch 001 | 0002645 updates | trainloss     7.40 | train ppl   169.38
| checkpoint 001 | epoch 001 | 0002645 updates | validloss     5.87 | valid ppl    58.37 | testloss     5.82 | test ppl    56.55
| epoch 001 | 0003645 updates | words/s    4371| trainloss     5.85 | train ppl    57.84
| epoch 001 | 0004645 updates | words/s    4373| trainloss     5.58 | train ppl    47.91
| checkpoint 002 | epoch 002 | 0005290 updates | s/checkpnt     783 | words/s    4373 | lr 0.250000
| checkpoint 002 | epoch 002 | 0005290 updates | trainloss     5.65 | train ppl    50.15
| checkpoint 002 | epoch 002 | 0005290 updates | validloss     5.25 | valid ppl    38.13 | testloss     5.21 | test ppl    36.96
| epoch 002 | 0006290 updates | words/s    4327| trainloss     5.33 | train ppl    40.15
| epoch 002 | 0007290 updates | words/s    4274| trainloss     5.24 | train ppl    37.82
| checkpoint 003 | epoch 003 | 0007935 updates | s/checkpnt     800 | words/s    4281 | lr 0.250000
| checkpoint 003 | epoch 003 | 0007935 updates | trainloss     5.25 | train ppl    38.07
| checkpoint 003 | epoch 003 | 0007935 updates | validloss     4.99 | valid ppl    31.81 | testloss     4.95 | test ppl    30.86
| epoch 003 | 0008935 updates | words/s    4235| trainloss      nan | train ppl      nan
| epoch 003 | 0009935 updates | words/s    4341| trainloss      nan | train ppl      nan
| checkpoint 004 | epoch 004 | 0010580 updates | s/checkpnt     791 | words/s    4325 | lr 0.250000
| checkpoint 004 | epoch 004 | 0010580 updates | trainloss      nan | train ppl      nan
| checkpoint 004 | epoch 004 | 0010580 updates | validloss      nan | valid ppl      nan | testloss      nan | test ppl      nan
| epoch 004 | 0011580 updates | words/s    4341| trainloss      nan | train ppl      nan
| epoch 004 | 0012580 updates | words/s    4347| trainloss      nan | train ppl      nan
| checkpoint 005 | epoch 005 | 0013225 updates | s/checkpnt     791 | words/s    4328 | lr 0.250000
| checkpoint 005 | epoch 005 | 0013225 updates | trainloss      nan | train ppl      nan
| checkpoint 005 | epoch 005 | 0013225 updates | validloss      nan | valid ppl      nan | testloss      nan | test ppl      nan

Is this something I should expect? Would you guess this is a parameter configuration issue (eg the optimizer is being too aggressive and overflowing) or does this suggest a bug (eg an overflow in the loss or perplexity code)?

denisyarats commented 7 years ago

Try to decrease the log resolution (-log_interval), it will help you to see if gradients are in fact exploding. If so, try to use a smaller learning rate and/or more strict clipping.

nuance commented 7 years ago

I switched to the adam optimizer (following the readme example) and thing appear to be progressing much more reasonably.

jgehring commented 7 years ago

The README contains good basic hyper-parameter settings for all supported models. Here, you're training a BLSTM model with nesterov accelerated gradient and hyper-parameters that were tuned for the fconv model. I recommend starting from the hyper-parameters mentioned in the readme (-optim adam -lr 0.0003125 -bptt 25 -clip 25) or doing a grid search to find good hyper-parameters for the nag optimizer.

facebookresearch / fairseq-lua

nan loss / ppl when training blstm model #38