ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
28.03k stars 12.81k forks source link

Chapter 4: Ridge regression using Gradient Descent [QUESTION] #510

Open lelezanardo opened 2 years ago

lelezanardo commented 2 years ago

Ridge regression using Gradient Descent Hi! I was trying to implement a Ridge regression in gradient descent by adding alpha*theta to the MSE gradient vector (where theta is the parameter vector).

So I've used the following code:

eta = 0.1 #learning rate
n_iterations = 1000
m=76
alpha = 1

theta_ridge = np.random.randn(2,1) #random starting theta
for iteration in range(n_iterations):
    gradients = 2/m*X_b.T.dot(X_b.dot(theta_ridge)-y) + alpha*theta_ridge[1] #theta_ridge[1] in order to exclude the intercept (theta_ridge[0])
    theta_ridge = theta_ridge - eta*gradients

theta_ridge
>> array([[-0.09590297],
       [ 0.19180594]])

Expected behavior I would expect the final parameters vector to be the same as the one that I find by using the Ridge module from sklearn.linear_model, but it is not:

from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver='lsqr')
ridge_reg.fit(X_b, y)

ridge_reg.coef_
>> array([[0.        , 0.28397242]])

I've also tried different solvers (even Cholesky as reported in the book) but I always get different results.

ageron commented 2 years ago

Hi @lelezanardo ,

Thanks for your feedback.

Remember that theta_ridge and gradients are both 2D arrays of shape [2, 1]. In other words, they're both column vectors. So when you add alpha * theta_ridge[1], you are actually adding a 1D array of shape [1] to a vector: it contains a single value. Doing this will add the single value to both elements of the gradients vector, which is not what you want. So instead, you should add: alpha * theta_ridge * [[0.], [1.]]. This is equivalent to adding the vector [[0.], [alpha * theta_ridge[1, 0]]].

Moreover, Scikit-Learn's Ridge class actually minimizes the Sum of Squared Errors (SSE), not the Mean Squared Error, and they also add alpha ||w||2 to the loss, instead of 1/2 alpha ||w||2. Therefore, to get the same result as the Ridge class, you need to scale alpha by a factor of 2 / m.

In short, here's the correct code:

eta = 0.1
for iteration in range(n_iterations):
    gradients = 2 / m * X_b.T.dot(X_b.dot(theta_ridge) - y)  # linear regression
    gradients += 2 * alpha / m * theta_ridge * [[0.], [1.]]  # add l2 penalty
    theta = theta - eta * gradients

Alternatively, you could minimize the SSE like they do, but then you would have to divide the learning rate by m:

eta = 0.1 / m
for iteration in range(n_iterations):
    gradients = 2 * X_b.T.dot(X_b.dot(theta_ridge) - y)  # linear regression
    gradients += 2 * alpha * theta_ridge * [[0.], [1.]]  # add l2 penalty
    theta = theta - eta * gradients

Here's a gist notebook with the first solution.

I'll update the book to make that clearer. Thanks again!