Chapter 4: Ridge regression using Gradient Descent [QUESTION]

ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Apache License 2.0

27.75k stars 12.72k forks source link

eta = 0.1 #learning rate n_iterations = 1000 m=76 alpha = 1 theta_ridge = np.random.randn(2,1) #random starting theta for iteration in range(n_iterations): gradients = 2/m*X_b.T.dot(X_b.dot(theta_ridge)-y) + alpha*theta_ridge[1] #theta_ridge[1] in order to exclude the intercept (theta_ridge[0]) theta_ridge = theta_ridge - eta*gradients theta_ridge >> array([[-0.09590297], [ 0.19180594]])

Hi @lelezanardo ,

Thanks for your feedback.

Remember that theta_ridge and gradients are both 2D arrays of shape [2, 1]. In other words, they're both column vectors. So when you add alpha * theta_ridge[1], you are actually adding a 1D array of shape [1] to a vector: it contains a single value. Doing this will add the single value to both elements of the gradients vector, which is not what you want. So instead, you should add: alpha * theta_ridge * [[0.], [1.]]. This is equivalent to adding the vector [[0.], [alpha * theta_ridge[1, 0]]].

Moreover, Scikit-Learn's Ridge class actually minimizes the Sum of Squared Errors (SSE), not the Mean Squared Error, and they also add alpha ||w||² to the loss, instead of 1/2 alpha ||w||². Therefore, to get the same result as the Ridge class, you need to scale alpha by a factor of 2 / m.

In short, here's the correct code:

eta = 0.1
for iteration in range(n_iterations):
    gradients = 2 / m * X_b.T.dot(X_b.dot(theta_ridge) - y)  # linear regression
    gradients += 2 * alpha / m * theta_ridge * [[0.], [1.]]  # add l2 penalty
    theta = theta - eta * gradients

Alternatively, you could minimize the SSE like they do, but then you would have to divide the learning rate by m:

eta = 0.1 / m
for iteration in range(n_iterations):
    gradients = 2 * X_b.T.dot(X_b.dot(theta_ridge) - y)  # linear regression
    gradients += 2 * alpha * theta_ridge * [[0.], [1.]]  # add l2 penalty
    theta = theta - eta * gradients

Here's a gist notebook with the first solution.

I'll update the book to make that clearer. Thanks again!

ageron / handson-ml2

Chapter 4: Ridge regression using Gradient Descent [QUESTION] #510