lesson 3 articles and research to read

datalass1 commented 5 years ago

datalass1 commented 5 years ago

Estimating an Optimal Learning Rate For a Deep Neural Network

Pavel Surmenok

How does learning rate impact training? Deep learning models are typically trained by a stochastic gradient descent optimizer. There are many variations of stochastic gradient descent: Adam, RMSProp, Adagrad, etc. All of them let you set the learning rate. This parameter tells the optimizer how far to move the weights in the direction opposite of the gradient for a mini-batch.

If the learning rate is low, then training is more reliable, but optimization will take a lot of time because steps towards the minimum of the loss function are tiny.

If the learning rate is high, then training may not converge or even diverge. Weight changes can be so big that the optimizer overshoots the minimum and makes the loss worse.

GradientDescent

The trick is to train a network starting from a low learning rate and increase the learning rate exponentially for every batch.

TrainingLearningRate

Record the learning rate and training loss for every batch. Then, plot the loss and the learning rate. Typically, it looks like this:

LRvsLoss

First, with low learning rates, the loss improves slowly, then training accelerates until the learning rate becomes too large and loss goes up: the training process diverges. We need to select a point on the graph with the fastest decrease in the loss. In this example, the loss function decreases fast when the learning rate is between 0.001 and 0.01.

Another way to look at these numbers is calculating the rate of change of the loss (a derivative of the loss function with respect to iteration number), then plot the change rate on the y-axis and the learning rate on the x-axis with a simple moving average to smooth the graph/reduce noise.

d/loss

The fastai library provides an implementation of the learning rate finder. You need just two lines of code to plot the loss over learning rates for your model:

learn.lr_find()
learn.sched.plot_lr()

The code to plot the rate of change of the loss function is as follows:

def plot_loss_change(sched, sma=1, n_skip=20, y_lim=(-0.01,0.01)):
    """
    Plots rate of change of the loss function.
    Parameters:
        sched - learning rate scheduler, an instance of LR_Finder class.
        sma - number of batches for simple moving average to smooth out the curve.
        n_skip - number of batches to skip on the left.
        y_lim - limits for the y axis.
    """
    derivatives = [0] * (sma + 1)
    for i in range(1 + sma, len(learn.sched.lrs)):
        derivative = (learn.sched.losses[i] - learn.sched.losses[i - sma]) / sma
        derivatives.append(derivative)

    plt.ylabel("d/loss")
    plt.xlabel("learning rate (log scale)")
    plt.plot(learn.sched.lrs[n_skip:], derivatives[n_skip:])
    plt.xscale('log')
    plt.ylim(y_lim)

plot_loss_change(learn.sched, sma=20)

Note that selecting a learning rate once, before training, is not enough. The optimal learning rate decreases while training.

There is more to it The conventional wisdom is that the learning rate should decrease over time, and there are multiple ways to set this up: step-wise learning rate annealing when the loss stops improving, exponential learning rate decay, cosine annealing, etc.

Cyclical Learning Rates for Training Neural Networks by Leslie N. Smith describes a novel way to change the learning rate cyclically.

datalass1 commented 5 years ago

Visualizing Learning rate vs Batch size, Neural Nets basics using Fast.ai tools

Miguel Perez Michaus

BatchSize

Bigger batch size shows bigger optimal learning rate, but the picture gives a more subtle and complete information. I find it interesting to see how those curves relate to each other, also worth noting how noise of the relationship increases as batch size gets smaller.

datalass1 commented 5 years ago

A practitioner's guide to PyTorch

Radek Osmulski

Tensor — (like) a numpy.ndarray but can live on the GPU.

Variable — allows a tensor to be part of a computation by wrapping itself around it. If created with requires_grad = True, will have gradients calculated during the backwards phase.

Things to keep in mind (or else risk going crazy)

Datatypes matter!
If it can overflow or underflow, it will. Visit AI With The Best Oct 2017 for knowledge of numerical stability.
Gradients accumulate by default! By default, gradients accumulate. You run a computation once, you run it backwards — everything is fine. But, for the second run, the gradients get added to the gradients from the first operation!

Visit PyTorch offical tutorials

datalass1 commented 5 years ago

Decoding the ResNet architecture

Anand Saha

The takeaway is that you should not be using smaller networks because you are afraid of overfitting. Instead, you should use as big of a neural network as your computational budget allows, and use other regularization techniques to control overfitting - cs231n

It is well known that increasing the depth leads to exploding or vanishing gradients problem if weights are not properly initialized. However, that can be countered by techniques like batch normalization.

Make it deep, but remain shallow Given a shallower network - how can we take it, add extra layers and make it deeper - without losing accuracy or increasing error? It’s tricky to do but one insight is that if the extra layers added to the deeper network are identity mappings, they become equivalent to the shallower network. And hence, they should produce no higher training error than it’s shallower counterpart. This is called a solution by construction.

shallowvsdeepnetwork Paper to read about Deep Residual Learning for Image Recognition.

Understanding residual A residual is the error in a result. Let’s say, you are asked to predict the age of a person, just by looking at her. If her actual age is 20, and you predict 18, you are off by 2. 2 is our residual here. If you had predicted 21, you would have been off by -1, our residual in this case. In essence, residual is what you should have added to your prediction to match the actual.

datalass1 / fastai