Traing

Use the ADAM optimizer. It works really well. Prefer it to more traditional optimizers such as vanilla gradient descent. TensorFlow note: If saving and restoring weights, remember to set up the Saver after setting up the AdamOptimizer, because ADAM has state (namely per-weight learning rates) that need to be restored as well.
ReLU is the best nonlinearity (activation function). Kind of like how Sublime is the best text editor. But really, ReLUs are fast, simple, and, amazingly, they work, without diminishing gradients along the way. While sigmoid is a common textbook activation function, it does not propagate gradients well through DNNs.
Do NOT use an activation function at your output layer. This should be obvious, but it is an easy mistake to make if you build each layer with a shared function: be sure to turn off the activation function at the output.
DO add a bias in every layer. This is ML 101: a bias essentially translates a plane into a best-fitting position. In y=mx+b, b is the bias, allowing the line to move up or down into the “best fit” position.
Use variance-scaled initialization. In Tensorflow, this looks like tf.contrib.layers.variance_scaling_initializer(). In our experience, this generalizes/scales better than regular Gaussian, truncated normal, and Xavier. Roughly speaking, the variance scaling initializer adjusts the variance the initial random weights based on the number of inputs or outputs at each layer (default in TensorFlow is number of inputs), thus helping signals to propagate deeper into the network without extra “hacks” like clipping or batch normalization. Xavier is similar, except that the variance is nearly the same in all layers; but networks with layers that vary greatly in their shapes (common with convolutional networks) may not cope as well with the same variance in each layer.
Whiten (normalize) your input data. For training, subtract the mean of the data set, then divide by its standard deviation. The less your weights have to be stretched and pulled in every which direction, the faster and more easily your network will learn. Keeping the input data mean-centered with constant variance will help with this. You’ll have to perform the same normalization to each test input as well, so make sure your training set resembles real data.
Scale input data in a way that reasonably preserves its dynamic range. This is related to normalization but should happen before normalizing. For example, data x with an actual real-world range of [0, 140000000] can often be tamed with tanh(x) or tanh(x/C) where C is some constant that stretches the curve to fit more of the input range within the dynamic, sloping part of the tanh function. Especially in cases where your input data may be unbounded on one or both ends, the neural net will learn much better between (0,1).
Don’t bother decaying the learning rate (usually). Learning rate decay was more common with SGD, but ADAM takes care of this naturally. If you absolutely want to squeeze out every ounce of performance: decay the learning rate for a short time at the end of training; you’ll probably see a sudden, very small drop in error, then it will flatten out again.
If your convolution layer has 64 or 128 filters, that’s probably plenty. Especially for a deep network. Like, really, 128 is A LOT. If you already have a high number of filters, adding more probably won’t improve things.
Pooling is for transform invariance. Pooling essentially lets the network learn “the general idea” of “that part” of an image. Max pooling, for example, can help a convolutional network become robust against translation, rotation, and scaling of features in the image.

batch normalization

batch size and momentum
- small batch size (32,64) :arrow_right: high momentum (0.9-0.99)
- big batch size :arrow_right: low momentum (0.6-0.85)
scale and shift
- BN before activation (none linear) :arrow_right: use scale and shift
- BN after activation (none linear) :arrow_right: not use
sgd learning rate and batch size
- big batch size :arrow_right: big learning rate. Interestingly, when the initial learning rate is decreased rather than increased, the effect of the batch normalization technique diminishes almost entirely, and we cannot confirm any significant speed or generalization improvements. Thus, our recommendation is to favor higher learning rates rather than lower ones when batch normalization is used. experiment paper
drop out
- Use dropout after all BN layers
Batch Renormalization
- Tiny batch size (32,64) => Use Batch Renormalization
Update moving statistics
- when use tensorflow

experiment paper

Based on our experiments, we can establish that the initial learning rate is in fact the most important hyperparameter in neural network training. We agree with the recommendation provided in [2] that one should pick the largest possible learning rate that does not cause the model to diverge. Batch normalization enables us to sample possible values for the initial learning rate from a larger distribution. After the initial learning rate is chosen, the next crucial hyperparameter is the learning rate decay. In our experiments, we found that adaptively decaying the learning rate based on the validation accuracy measured after each epoch performs strictly better than exponential or power decay. Naturally, one can find optimal parameters for power and exponential decay with cross validation, but decaying the learning rate based on the validation accuracy is an intuitive heuristic that works very well in practice. Regarding weight initialization, we recommend the use of variance preserving initialization schemes such as the ones discussed in chapter 2, whether batch normalization is used or not. Specifically we recommend using Kaiming initialization for rectified activations such as ReLU and ELU. Although saturating nonlinearities are not recommended, one should favor Xavier initialization for sigmoids and hyperbolic tangents.

ISCAS007 / PaperReading

Practical Advice for Deep Learning #10

Traing

batch normalization

experiment paper