Open yzbx opened 5 years ago
Based on our experiments, we can establish that the initial learning rate is in fact the most important hyperparameter in neural network training. We agree with the recommendation provided in [2] that one should pick the largest possible learning rate that does not cause the model to diverge. Batch normalization enables us to sample possible values for the initial learning rate from a larger distribution. After the initial learning rate is chosen, the next crucial hyperparameter is the learning rate decay. In our experiments, we found that adaptively decaying the learning rate based on the validation accuracy measured after each epoch performs strictly better than exponential or power decay. Naturally, one can find optimal parameters for power and exponential decay with cross validation, but decaying the learning rate based on the validation accuracy is an intuitive heuristic that works very well in practice. Regarding weight initialization, we recommend the use of variance preserving initialization schemes such as the ones discussed in chapter 2, whether batch normalization is used or not. Specifically we recommend using Kaiming initialization for rectified activations such as ReLU and ELU. Although saturating nonlinearities are not recommended, one should favor Xavier initialization for sigmoids and hyperbolic tangents.
Traing