3-Improving your Image Classifier

Student work (0-10mins)

Starts off sharing some interesting student work and blogs. Links added to issue #11 to read and understand.

How to download data from Kaggle (10-15mins)

Using Kaggle-cli or CurlWget Symbolic Link (SymLinks) (very handy) discussed.

Quick Cats v Dogs (15-20mins)

Really great example of how short the code is using the fastai library.

Using Keras instead of fastai (20-30mins)

Keras with Tensorflow in the backend requires much MUCH more code, with more settings required (copying and pasting from the web is the best way to do a lot of this), and takes longer to run the training of the model. For the model, i.e. ResNet50, you have to set all the layer settings:
$ pip install tensorflow-gpu keras if I want to test out this code.

SGDR, differential Learning Rates, Batch Norm freezing for example would all have to be implemented in Keras to get as good results as fastai gets.

Performance is completely different. Better accuracy with both training and validation sets using fastai.

Google may import fastai into Tensorflow.

Dog Breeds and Kaggle (30-40mins)

Dog Breeds and shows how to upload results to Kaggle. Insert this into the next model notebook and upload first dataset to Kaggle! #13

Also how to predict a single image.

Theory - what is going on behind the scenes with EXCEL (40-1hr20mins)

http://setosa.io/ev/image-kernels/ https://www.youtube.com/watch?v=Oqm9vsf_hvU - otavio good visualisation of MNIST dataset

CNNs with MNIST dataset explained in an excel spreadsheet:

conv-example.xlsx
every pixel of the input is a number between 0 and 1
The resulting number is a number, aka activation
The activation is calculated by taking some numbers of the input and applying some kind of linear operation, in this case a convolution kernel (aka filter), to calculate an output.
Max(0, SUM(input3x3*filter3x3) - a rectified linear unit aka RLU
depending on the convolutional filters, horizontal edges will be activated and vertical ones will not be.
the convolutional filters are created in training of the model 3x3 matrix: visualised as:
A second convolutional filter (array, filter, kernel all mean the same) and having more than one is described as a Tensor, which means an array with more dimensions: an additional axis stacking all the different 3x3 filters together.
we can start with 8 randomly generated filters; that is 8 3x3 matrices with random elements. Given labeled inputs, we can then use stochastic gradient descent to determine what the optimal values of these filters are, and therefore we allow the neural network to learn what things are most important to detect in classifying images. -Zero-Padding: the filter necessarily operates on the premise that there are 8 surrounding pixels. All zero-padding does is add a extra borders of zero pixels around the image prior to passing through a filter so that the output shape from the filter is the same as the input shape.
The result is a Hidden layer, size 2 because it has two filters.
An architecture means how big is your kernel at layer one, how many filters are in your kernel at layer one.
Maxpooling: Put simply, a max pooling layer reduces the dimensionality of images (resolution) by reducing the number of pixels in the image. It does so by replacing an entire NxN area of the input image with the maximum pixel value in that area.
The other reason we utilize pooling is simply to reduce the amount of parameters and computation load. It is also helpful in controlling overfitting.
maxpooling is replacing every 2x2 no overlaps with the max number.
fully connected layer means to take every activation and give every activation a weight creating a really big weight matrix as big as the input. The result will be a sum product of every weight. The example used is after maxpooling.
Weight-Matrices: a fully connected layer consisted of a matrix of weights that acted upon an input through matrix multiplication and produced an output vector that, subject to some bias, was then passed through a non-linearity of some sort (our activation layer).
VGG first successful architecture. It is fully connected. These fully connected layers can be really big and great.
multichannels (bands): usually architectures are 3 channel for RGB and so if a image classification problems has 2 channels (such as the iceberg challenge on Kaggle) a third could be created using an average of the 2 channels of repeating one for example. What about 4 channels? Could add another level to the convolutional kernels.
The filters are random numbers, and SGD is used to improve these numbers to make them less random through the convolutions.
entrophy_example.xslx Once we have a fully connected layer we want to predict the probability of n. Softmax is used: exp(n)/total(n's) = probability; a number between 0 and 1 AND totals 1.

Softmax is typically used as our final activation layer as output. The softmax function is defined as:

exp(x)/sum(exp(x))

where x is an array of activations.

Hints: Always need to know logarithms and exponential ln(x.y) = ln(x) + ln(y) ln(x/y) = ln(x) - ln(y) ln(x) = y and exp(y) = x <-------log and exp are the inverse

Architecture - added from fastai wiki We know now from the Universal Approximation Theorem that any large enough neural network can approximate any arbitrarily complex function.....some of them can learn to solve these problems much faster than others (and will likely generalize better) by having far less parameters. This is why we care about understanding architectures such as convolutional neural networks as opposed to trying to solve every problem with deep fully connected neural networks.

Multi-label classification (1hr20mins - 2hr)

satellite imagery competition from Kaggle

"anthropomorphise functions" - the softmax function wants to pick a thing. Understanding the personality of the activation function.

fastai will look at the labels in a csv and if there is more than one label, it ill automatically switch into multilabel functions.

HINT: The folder approach will not work, how can an image be in multiple folders at the same time. So we have to use the csv approach.

The images a size 256.

PyTorch will really leverage existing python functionality.

x,y = next(iter(data.val_dl))

A data loader will give you back a mini batch (a data set will give you back a single image or single object). To turn a data loader into an iterator we used a standard python function known as iter. That’s an iterator. To fetch the next minibatch pass the iter to next. An iterator/generator are similar. PyTorch is a good reason to learn Python well bs=64 (**kwarg) in tfms_from_model function so we return a mini batch of 64x17; 64 images with 17 of the possible classes.

zip is a great way of taking two lists and zip them together. Lets look at the first image with this code:

list(zip(data.classes, y[0]))

images are just matrices of numbers, so to display it better just enhance the image:

plt.imshow(data.val_ds.denorm(to_np(x))[0]*1.4);

Image size: if we use a pretrained mode (e.g. from ImageNet) it starts off nearly perfect, like in the cat vs dogs image classification problem and change the image size (sz) we will effectively kill the pretrained layers that were trained on an image size of 224 or 256. But there is nothing in ImageNet that looks like satellite imagery, exception would be for layers like finding edges and gradients or finding textures and gradient patterns. Small images for satellites like sz=64 is quite good.

Find out what learning rate to use. Because it is so unlike ImageNet lots of fitting required. The learning rate for the earlier layers is set quite high. Iterate this a few times with different image sizes and then tta at the end.

Questions at the end and to Note:

what is data.resize? go through the data and resize, because if the image is 1000x1000 the sz=64 actually takes longer than the CNN. So use resize to speed things up.
f2 is a way of weighting false negatives vs false positives. It is noted in the Kaggle competition that this is the method used to score the accuracy of the model: f beta. You can find it in courses/dl1/planet.py: the function f2 simply calls f beta score from scipy/scikit-learn.
activation function is sigmoid for multilabel classification. It's what we do in logistic regression!
learning rate differential learning rates means you don't need to unfreeze subsets of layers. The idea is that once you have trained the later layer, you can then go onto unfreeze all layers and apply the differential learning rates.
training means setting weights
size 64 means to take the smallest edge and resize to 64 and take the centre area of the image, although if you do you data augmentation it take a randomly generated crop. Without data augmentation if the input is a square no problem, if it is a rectangle then you will miss some features on the periphery. That's why is it important to get the randomly generated crops.
learn.summary() to view the model and more details.

Structured and Time Series Data - looking at grocery data (2hr - end)

Two types of data in ML:

unstructured: audio, video, image, NL text.
structured: profit and loss statement, facebook user info - each row represents an observation. Structured data is often not considered important in the research world, it's the stuff that makes the world go round. But we will consider it in fastai because it is practical deep learning.

lesson3-rossman Predicting what and how much of an item will be sold in a store(s) on a particular day.

Get the data go to the folder you want the data to be in and in cmd line (it's not behind a login): ```$ wget
Look at the ML course to understand feature engineering.
Lots of code [a,b,c etc] within columns. A data dictionary will help decipher what this means but it is not important at the beginning, start by seeing what does the data say.
Relational dataset with lots of tables you will want to merge together.
to_feather format really useful
Some data is categorical (day of week) and some are continuous (distance). The categorical data will be one-hot encoded. The continuous will be fed into fully connected layers just as they are.
create a validation set.
taking the data from a dataframe.
lr find to get the best learning rate.
using m.fit to get the best model.

HOMEWORK: Enter lots of Kaggle competitions, test out the fastai techniques on lots of image datasets. This will help understand the content in lesson 4.

datalass1 / fastai