lesson 2 - Read cyclical learning rates for training neural networks

The essence of this learning rate policy comes from the observation that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect.

This observation leads to the idea of letting the learning rate vary within a range of values rather than adopting a stepwise fixed or exponentially decreasing value. That is, one sets minimum and maximum boundaries and the learning rate cyclically varies between these bounds.

The paper is focused on the triangular learning rate policy, all others were equivalent and the triangular is simplest.

The length of a cycle and the input parameter stepsize can be easily computed from the number of iterations in an epoch. An epoch is calculated by dividing the number of training images by the batchsize used. For example, CIFAR-10 has 50, 000 training images and the batchsize is 100 so an epoch = 50, 000/100 = 500 iterations

How can one estimate reasonable minimum and maximum boundary values? A “LR range test”; run your model for several epochs while letting the learning rate increase linearly between low and high LR values. This test is enormously valuable whenever you are facing a new architecture or dataset. Note the learning rate value when the accuracy starts to increase and when the accuracy slows, becomes ragged, or starts to fall. These two learning rates are good choices for bounds; that is, set base lr to the first value and set max lr to the latter value.

Using Caffe framework Caffe: There is clear performance improvement when using CLR with this architecture containing sigmoids and batch normalization.

When using adaptive learning rate methods, the benefits from CLR are sometimes reduced, but CLR can still valuable as it sometimes provides benefit at essentially no cost.

ResNets, Stochastic Depth, and DenseNets experiments Residual networks and the family of variations that have subsequently emerged, achieve state-of-the-art results on a variety of tasks.

AlexNet Since the batchsize in the architecture file is 256, an epoch is equal to 1, 281, 167/256 = 5, 005 iterations. Hence, a reasonable setting for stepsize is 6 epochs or 30, 000 iterations

Deep Learning Frameworks, Medium article listing them: TensorFlow, PyTorch, Keras, Caffe etc. Deep Learning Architectures: ResNet, AlexNet, GoogLeNet etc.

GoogLeNet/Inception Architecture The first step is to estimate the stepsize setting. Since the architecture uses a batchsize of 128 an epoch is equal to 1, 281, 167/128 = 10, 009 iterations. Hence, good settings for stepsize would be 20, 000, 30, 000, or possibly 40, 000

The next step is to estimate the bounds for the learning rate, which is found with the LR range test by making a run for 4 epochs where the learning rate linearly increases from 0.001 to 0.065

conclusion This policy is easy to implement and unlike adaptive learning rate methods, incurs essentially no additional computational expense.

Improved performance for a range of architecures.

Measured approach to setting the learning rate rather than guessing.

datalass1 / fastai

lesson 2 - Read cyclical learning rates for training neural networks #9