Open shaifugpt opened 6 years ago
I am having the same problem. When nan is returned as an output, the loss is reported as ~0, so you are really not training anything. Playing with it, I'm pretty sure I've isolated the issue to the gradients during backprop. When using an optimizer like adam or rmsprop, I always got nan outputs within 100 training steps. Using sgd, I found that my training is numerically stable after ~2,000 training steps and counting.
I ran the model simple seq2seq model for time series predictions. The train predictions are good and the train loss wen till order of 10^-4. But the out of sample predictions are very poor and no where close to original. Infact after some points of predictions, the values predicted are nan. Any suggestions on why is it so