Lesson 6 - Regularization; Convolutions; Data ethics

Platform.ai

Focus is on batch labeling dozens maybe hundreds of examples at a time and learning from every aspect of the human interaction (not just the explicit label). Platform.ai will leverage transfer learning using an Imagenet trained network and various dimensionality reduction techniques and uses ImageNet . It's a bit like clustering, human labels a few and machine learning then helps.

Regularization for Tabular Learner

The rossmann data set is they've got 3,000 drug stores in Europe and you're trying to predict how many products they're going to sell in the next couple of weeks.

It's a time-series. Predicting a future dataset.

Another interesting thing about it is the evaluation metric they provided is the root mean squared percent error.

add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)

For something like taking a date like this (2015-07-31 00:00:00) and figuring out that the fifteenth of the month is something when interesting things happen.

The key thing here is that we're trying to on a particular date for a particular store ID, we want to predict the number of sales. Sales is the dependent variable.

Preprocesses

Transforms are bits of code that run every time something is grabbed from a data set so it's really good for data augmentation.

Preprocesses are like transforms, but they're a little bit different which is that they run once before you do any training. Really importantly, they run once on the training set and then any kind of state or metadata that's created is then shared with the validation and test set.

Let me give you an example. When we've been doing image recognition and we've had a set of classes to all the different pet breeds and they've been turned into numbers. The thing that's actually doing that for us is a preprocessor that's being created in the background. That makes sure that the classes for the training set are the same as the classes for the validation and the classes of the test set.

We create a little small subset of a data for playing with:

idx = np.random.permutation(range(n))[:2000]
idx.sort()
small_train_df = train_df.iloc[idx[:1000]]
small_test_df = train_df.iloc[idx[1000:]]
small_cont_vars = ['CompetitionDistance', 'Mean_Humidity']
small_cat_vars =  ['Store', 'DayOfWeek', 'PromoInterval']
small_train_df = small_train_df[small_cat_vars + small_cont_vars + ['Sales']]
small_test_df = small_test_df[small_cat_vars + small_cont_vars + ['Sales']]

Normalise/standardise - subtract the mean and divide by the std dev.

You've actually got to think carefully about which things should be categorical variables. On the whole, if in doubt and there are not too many levels in your category (that's called the cardinality), if your cardinality is not too high, I would put it as a categorical variable. E.g. Day of the week/month

Float = regression Int = classification So label class (label_cls=FloatList). So when we label it, we have to tell it that the class of the labels we want is a list of floats, not a list of categories (which would otherwise be the default). So this is the thing that's going to automatically turn this into a regression problem for us.

log is something which if true, it's going to take the logarithm of my dependent variable. Why am I doing that? So this is the thing that's actually going to automatically take the log of my y, the evaluation metric is root mean squared percentage error.

For a tabular model, our architecture is literally the most basic fully connected network: It's an input, matrix multiply, non-linearity, matrix multiply, non-linearity, matrix multiply, non-linearity, done.

use regularization (e.g weight decay, dropout and embedding dropout); not to reduce the number of parameters: the intermediate weight matrix is going to have to go from a 1000 activation input to a 500 activation output, which means it's going to have to be 500,000 elements in that weight matrix. That's an awful lot for a data set with only a few hundred thousand rows. So this is going to overfit, and we need to make sure it doesn't.

What is Dropout? Check this Dropout: A Simple Way to Prevent Neural Networks from Overfitting, N.Srivastava, G.Hinton et al We throw away some activations! This activation here (circled in red) is the sum of all of these inputs times all of these activations. So that's what a normal fully connected neural net looks like. For dropout, we throw that away.

At random, we throw away some percentage of the activations not the weights, not the parameters. Remember, there's only two types of number in a neural net - parameters also called weights (kind of) and activations. So we're going to throw away some activations.
For each mini batch, we throw away a different subset of activations.
We throw each one away with a probability p. A common value of p is 0.5
It means that no 1 activation can memorize some part of the input because that's what happens if we over fit.
it's kind of nice to look inside the PyTorch source code and see dropout; this incredibly cool, incredibly valuable thing, is really just these three lines of code which they do in C
```
learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, 
                    y_range=y_range, metrics=exp_rmspe)
```

In this case, we're going to use a tiny bit of dropout on the first layer (0.001) and a little bit of dropout on the next layer (0.01), and then we're going to use special dropout on the embedding layer.

G.Hinton noticed every time he went to his bank that all the tellers and staff moved around, and he realized the reason for this must be that they're trying to avoid fraud. "When you actually ask people where did your idea for some algorithm come from, it basically never comes from math; it always comes from intuition and thinking about physical analogies and stuff like that."

Embedding: a method used to represent discrete variables as continuous vectors. Great blog

An embedding dropout is simply just a dropout. So it's just an instance of a dropout module. For continuous variables, that continuous variable is just in one column. You wouldn't want to do dropout on that because you're literally deleting the existence of that whole input which is almost certainly not what you want. But for an embedding, and embedding is just effectively a matrix multiplied by a one hot encoded matrix, so it's just another layer. So it makes perfect sense to have dropout on the output of the embedding, because you're putting dropout on those activations of that layer. So you're basically saying let's delete at random some of the results of that embedding (i.e. some of those activations). So that makes sense.

Batch Normalisation paper

In the last two months, there's been two papers (so it took three years for people to really figure this out), in the last two months, that have shown batch normalization doesn't reduce covariate shift at all. And even if it did, that has nothing to do with why it works. This paper

Here are steps or batches (x-axis). And here is loss (y-axis). The red line is what happens when you train without batch norm - very very bumpy. And here, the blue line is what happens when you train with batch norm - not very bumpy at all. What that means is, you can increase your learning rate with batch norm.

Remember, in a neural net there's only two kinds of number; activations and parameters.

why is that able to achieve this fantastic result?

The value of our predictions y-hat is some function of our various weights. There could be millions of them (weight 1 million) and it's also a function, of course, of the inputs to our layer.
This function f is our neural net function whatever is going on in our neural net. Then our loss, let's say it's mean squared error, is just our actuals minus our predicted squared.
What if we went times g plus b? We added 2 more parameter vectors. In order to increase the scale, that number g has a direct gradient to increase the scale. To change the mean, that number b has a direct gradient to change the mean.

Data Augmentation

Kind of regularization Lots of great transformations for the data

Convolutional Neural Network

This is a heat map. This is a picture which shows me what part of the image did the CNN focus on when it was trying to decide what this picture is. We're going to make this heat map from scratch.

A convolution is just a kind of matrix multiply which has some interesting properties. Setosa visualisation

A fantastic post from Matt Kleinsmith Convolution represented as a matrix multiplication

But if you think about it, we actually don't have a 2D input anymore, we have a 3D input (i.e. a rank 3 tensor). So we probably don't want to use the same kernel values for each of red and green and blue, we need to create a 3 by 3 by 3 kernel. But rather than doing an element-wise multiplication of 9 things, we're going to do an element-wise multiplication of 27 things (3 by 3 by 3) and we're still going to then add them up into a single number.

We started with 5 by 5, so we're going to end up with an output that's also 5 by 5. But now our input was 3 channels and our output is only one channel. We're not going to be able to do very much with just one channel, because all we've done now is found the top edge. How are we going to find a side edge? and a gradient? and an area of constant white? Well, we're going to have to create another kernel, and we're going to have to do that convolved over the input, and that's going to create another 5x5. Then we can just stack those together across this as another axis, and we can do that lots and lots of times and that's going to give us another rank 3 tensor output.

In order to avoid our memory going out of control, from time to time we create a convolution where we don't step over every single set of 3x3, but instead we skip over two at a time. We would start with a 3x3 centered at (2, 2) and then we'd jump over to (2, 4), (2, 6), (2, 8), and so forth. That's called a stride 2 convolution.

Manual Convolutions

k = tensor([
    [0.  ,-5/3,1],
    [-5/3,-5/3,1],
    [1.  ,1   ,1],
]).expand(1,3,3,3)/6

I've created a convolutional kernel. As you can see, this one has a right edge and a bottom edge with positive numbers, and just inside that, it's got negative numbers. So I'm thinking this should show me bottom-right edges.

One complexity is that a 3x3 kernel cannot be used for this purpose, because I need two more dimensions. The first is I need the third dimension to say how to combine the red green and blue. So what I do is I say .expand, this is my 3x3 and I pop another three on the start. What .expand does is it says create a 3 by 3 by 3 tensor by simply copying this one 3 times. k

tensor([[[[ 0.0000, -0.2778,  0.1667],
          [-0.2778, -0.2778,  0.1667],
          [ 0.1667,  0.1667,  0.1667]],

         [[ 0.0000, -0.2778,  0.1667],
          [-0.2778, -0.2778,  0.1667],
          [ 0.1667,  0.1667,  0.1667]],

         [[ 0.0000, -0.2778,  0.1667],
          [-0.2778, -0.2778,  0.1667],
          [ 0.1667,  0.1667,  0.1667]]]])

The 4D tensor is just a bunch of 3D tensors sitting on top of each other. Like so: k.shape torch.Size([1, 3, 3, 3])

PyTorch is channel by height by width: our image is 3 channels by 352 by 352 (height by width).

Creating Heat Map I basically have my input red green blue. It goes through a bunch of convolutional layers to create activations which have more and more channels and smaller and smaller height by widths.

Eventually, we ended up with something which was 11 by 11 by 512.

Now there are 37 classes. So it's a vector of length 37.

Somehow we need to get from this 11 by 11 by 512 to this 37. The way we do it is we actually take the average of every one of these 11 by 11 faces. We just take the mean. We're going to take the mean of this first face, take the mean, that gets this one value. Then we'll take second of the 512 faces, and take that mean, and that'll give us one more value. So we'll do that for every face, and that will give us a 512 long vector.

Now all we need to do is pop that through a single matrix multiply of 512 by 37 and that's going to give us an output vector of length 37. This step here where we take the average of each face is called average pooling.

For the heatmap we care about the hook rather than the predictions. The hook is the activations.

The most important thing is .shape

Ethics and Data Science

There's a lot of bias in the content we're creating, because of a bias in the people that are creating that content even when, in theory, it's being created in a very kind of neutral way, but you can't argue with the data. It's obviously not neutral at all.

To summarize, we are part of the0.3 to 0.5% of the world that knows how to code. We have a skill that very few other people do. Not only that, we now know how to code deep learning algorithms which is like the most powerful kind of code I know. So I'm hoping that we can explicitly think about at least not making the world worse, and perhaps explicitly making it better.

datalass1 / fastai