2 - Convolutional Neural Networks

sz = 224 #this is the size 224 * 224 used by resnet.

Learning Rate Basic concept of LR = how quickly we will zoom onto the solution; how quickly or how slowly you want to update the weights or parameters. Paper: https://arxiv.org/pdf/1506.01186.pdf

If LR too low = the model will take too long. LR too high = overshoot and oscillate. Solution = learning rate finder. This technique will keep increasing the learning rate from a very small value until the loss starts decreasing. We can plot the learning rate across BATCHES. In fastai lesson 2, 360 is the number of iterations (or minibatches) it takes to precompute the activation's (~23000 images / 64 batch size) aka stochastic gradient descent in the train set.

LR Finder is like a sophisticated form of grid search where grid search is trying to the find the best result for a particular hyper parameter.

The next step is to plot loss vs. learning rate to see where our loss stops decreasing.

Over-fitting Means specific to the training data as opposed to more generalisations that can be transferred.

Best way to overcome overfitting is to get more data. One technique is data augmentation: flipping , zooming, rotating.

Precompute = precomputed activations that we feed into the network except for the last layer. In order to use data augmentation precompute needs to be set to False in which case all convolutional layers are frozen.

cycle length uses stochastic gradient descent with restarts (SGDR), as variant of learning rate annealing which gradually decreases the learning rate as training progresses. This is helpful because as we get closer to the optimal weights we want to take smaller steps.

It is important to find places in the weight space that are both accurate and stable, so from time to time we increase the learning rate (this is the 'restarts in SGDR').

differential learning rates Unfreeze the learner; ready to use the pre-trained imagenet model. Important that we don't destroy the layer.

Use a numpy array: lr = np.array([1e-4, 1e-3, 1e-2])

Test Time Augmentation (TTA): makes predications on a number of randomly augmented versions of the images.

Analysing the results

Confusion Matrix Get the preds (predictions) and the probs (probabilities). y vs. preds.

Steps for world class deep learning model

Enable data augmentation, and precompute=True
Use lr_find() to find highest learning rate where loss is still clearly improving
Train last layer from precomputed activations for 1-2 epochs
Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
Unfreeze all layers
Set earlier layers to 3x-10x lower learning rate than next higher layer
Use lr_find() again
Train full network with cycle_mult=2 until over-fitting

Questions?

What are hyper parameters? learning rate and the number of epochs. Basically, high level properties of the model such as its ocmplexity or how fast it should learn.

What is an epoch? a single pass through the ENTIRE training set and will consist of multiple iterations of SGD.

What is a batch or mini-batch? a subset of training samples used in one iteration of SGD.

What is accuracy? the ratio of correct prediction to the total number of predictions.

What is loss? In ML the loss function or cost function is representing the price paid for inaccuracy of predictions.

datalass1 / fastai

2 - Convolutional Neural Networks #7

Analysing the results

Questions?