eriklindernoren / ML-From-Scratch

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.
MIT License
23.58k stars 4.55k forks source link

The gradient in gradient descent is not correct. #38

Closed dschaehi closed 4 years ago

dschaehi commented 6 years ago

Since the gradient was built from MSE, it should be divided by the number of training examples. However, this number is omitted in the code. This makes the learning rate dependent on the number of training examples, because the more training examples you have, the bigger will be the gradient. The learning rate should not be dependent on the number of training examples.

https://github.com/eriklindernoren/ML-From-Scratch/blob/f078fc384e3188922431e6747eefaa1561f361c4/mlfromscratch/supervised_learning/regression.py#L76

akhilvasvani commented 5 years ago

Ok, so I'm a tad bit confused. According to http://web.mit.edu/zoya/www/linearRegression.pdf:

the MSE for the Linear Regression with regularization is:

$$ J(\theta) = \frac{1}{n} \sum_{i=1}{n} \frac{1}{2} [ y^{(i)} - \hat{\theta}^{(i)} \hat{x}^{(i)} ]^{2} + \frac{\lambda}{2} ||\theta||^{2}.$$ The $\frac{1}{n}$ is not distributed over the whole expression, so using: np.mean (0.5 * (y - y_pred)**2 + self.regularization(self.w)) does not really make sense to me from what I read.

I guess, he's using np.mean as a short-cut for summing over $n$ and then multiplying by $\frac{1}{n}$.

So here's an easy fix:

mse = np.mean (0.5 * (y - y_pred)**2) + self.regularization(self.w)

grad_w = np.mean(-(y - y_pred).dot(X)) + self.regularization.grad(self.w)

Food for thought: What's the difference in the computation time using np.mean instead of summing over all training examples and then multiplying by 1/(all the training samples)?