Linear or non-linear - Githubissues

lubu-Alex commented 5 years ago

Sorry, there's a problem that has troubled me for a long time. Why is the boundary of the transformed feature space linear? And why is the boundary of the original feature space nonlinear? Is it your code that forces the boundaries of the transformed feature space to become linear boundaries? Thank you so much.

dvgodoy commented 5 years ago

Hi,

To fully understand the reason behind this behavior, I suggest you read my post that gave birth to this package :-) https://towardsdatascience.com/hyper-parameters-in-action-a524bf5bf1c

In short, it is not the code, but the activation function that is responsible for this pattern. You see a linear boundary at the end because the output has a linear activation, that operates on a transformed space (deformed by a sigmoid/tanh/relu). If we "convert" this straight line to the original not-deformed space, it will be non-linear.

lubu-Alex commented 5 years ago

Believe it or not, I read your article half a year ago and felt that the visualization was fantastic. Then I used your code to reproduce it myself. Just yesterday, I saw your article again. I read it several times, and then I kept thinking, why is the decision boundary a straight line? model = Sequential() model.add(Dense(input_dim=2, units=2, activation='sigmoid', kernel_initializer=glorot_normal(seed=42), name='hidden')) model.add(Dense(units=1, activation='sigmoid', kernel_initializer=normal(seed=42), name='output')) In your code, the output is not a linear activation, but a sigmoid activation. If the output layer has a sigmoid activation, the decision boundary cannot be a straight line.

the actual role of the non-linearity is to twist and turn the feature space so much so that the boundary turns out to be… LINEAR!

Why non-linearity can transform the boundary to be linear? Is it a coincidence? Thank you very much for your answer.

dvgodoy commented 5 years ago

You're absolutely right, the final activation is a sigmoid indeed. But the purpose of it is only to "squash" the z-value (which would be the linear activation) into a [0, 1] interval to make it a probability. Now, let's consider the z-value itself - it is a linear combination of the outputs of the two neurons in the previous layer, right? So, the z-value of the output neuron is something like: a1w1 + a2w2, where a1 and a2 are the values coming from the two neurons in the previous layer, and w1 and w2 are the weights. Also, remember that a sigmoid(0) = 0.5, meaning, for a z-value of zero, the sigmoid gives a probability of 50%. Then, if we want to plot the boundary, we are saying: for probability bigger than 0.5, we say it is positive class, for a probability smaller than 0.5, it is negative class. Well, this is exactly the same as saying that, if the z-value is greater than 0, it is positive class, if z-value is less than zero, it is a negative class. But what is the z-value here? It is a1w1 + a2w2, right? So, we are saying:

a1w1 + a2w2 > 0 => positive class
a1w1 + a2w2 < 0 => negative class

So, the boundary is given by a1w1 + a2w2 = 0, right? This is a straight line! That's why the decision boundary is linear.. the sigmoid is just converting it into probabilities at the output. BUT What the sigmoid (or tanh, or relu) does in other layers has a different effect... as you can see from my animations, its non-linearity is "twisting" the points from its original positions to some other location in the grid. At some point, all the points will be nicely placed in a way that that straight line will split them nicely. You could also think of it as having a straight line somewhere in the grid (as if we knew it before hand that the linear boundary is there), and then we play with the other weights in the network in such a way that, eventually, the twisting effect of the activations will place the points separated by the boundary. It is hard to explain all this in text, but I hope it helps...

lubu-Alex commented 5 years ago

Thank you so much. Now I understand. By the way, your blog is great. You can add this explanation to it :)

dvgodoy / deepreplay

Linear or non-linear #16