The gradient in gradient descent is not correct.

eriklindernoren / ML-From-Scratch

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

MIT License

23.58k stars 4.55k forks source link

Ok, so I'm a tad bit confused. According to http://web.mit.edu/zoya/www/linearRegression.pdf:

the MSE for the Linear Regression with regularization is:

$$ J(\theta) = \frac{1}{n} \sum_{i=1}{n} \frac{1}{2} [ y^{(i)} - \hat{\theta}^{(i)} \hat{x}^{(i)} ]^{2} + \frac{\lambda}{2} ||\theta||^{2}.$$ The $\frac{1}{n}$ is not distributed over the whole expression, so using: np.mean (0.5 * (y - y_pred)**2 + self.regularization(self.w)) does not really make sense to me from what I read.

I guess, he's using np.mean as a short-cut for summing over $n$ and then multiplying by $\frac{1}{n}$.

So here's an easy fix:

mse = np.mean (0.5 * (y - y_pred)**2) + self.regularization(self.w)

grad_w = np.mean(-(y - y_pred).dot(X)) + self.regularization.grad(self.w)

Food for thought: What's the difference in the computation time using np.mean instead of summing over all training examples and then multiplying by 1/(all the training samples)?

eriklindernoren / ML-From-Scratch

The gradient in gradient descent is not correct. #38