Closed nuance closed 7 years ago
Try to decrease the log resolution (-log_interval), it will help you to see if gradients are in fact exploding. If so, try to use a smaller learning rate and/or more strict clipping.
I switched to the adam optimizer (following the readme example) and thing appear to be progressing much more reasonably.
The README contains good basic hyper-parameter settings for all supported models. Here, you're training a BLSTM model with nesterov accelerated gradient and hyper-parameters that were tuned for the fconv model. I recommend starting from the hyper-parameters mentioned in the readme (-optim adam -lr 0.0003125 -bptt 25 -clip 25
) or doing a grid search to find good hyper-parameters for the nag optimizer.
I'm training a set of translation models using the suggested
fconv
parameters (but the model switched toblstm
):I'm seeing loss and perplexity become
nan
after a few epochs:Is this something I should expect? Would you guess this is a parameter configuration issue (eg the optimizer is being too aggressive and overflowing) or does this suggest a bug (eg an overflow in the loss or perplexity code)?