Closed accosmin closed 8 years ago
This appear for other stochastic methods (e.g. Adam or AdaGrad) and they seem related to the "epsilon" parameter being too high when normalizing the weighted gradient. However these configurations are correctly pruned during tuning.
Adadelta produces nans when training convolution networks on MNIST.