Closed dschaehi closed 4 years ago
Ok, so I'm a tad bit confused. According to http://web.mit.edu/zoya/www/linearRegression.pdf:
the MSE for the Linear Regression with regularization is:
$$ J(\theta) = \frac{1}{n} \sum_{i=1}{n} \frac{1}{2} [ y^{(i)} - \hat{\theta}^{(i)} \hat{x}^{(i)} ]^{2} + \frac{\lambda}{2} ||\theta||^{2}.$$ The $\frac{1}{n}$ is not distributed over the whole expression, so using: np.mean (0.5 * (y - y_pred)**2 + self.regularization(self.w))
does not really make sense to me from what I read.
I guess, he's using np.mean as a short-cut for summing over $n$ and then multiplying by $\frac{1}{n}$.
So here's an easy fix:
mse = np.mean (0.5 * (y - y_pred)**2) + self.regularization(self.w)
grad_w = np.mean(-(y - y_pred).dot(X)) + self.regularization.grad(self.w)
Food for thought: What's the difference in the computation time using np.mean instead of summing over all training examples and then multiplying by 1/(all the training samples)?
Since the gradient was built from MSE, it should be divided by the number of training examples. However, this number is omitted in the code. This makes the learning rate dependent on the number of training examples, because the more training examples you have, the bigger will be the gradient. The learning rate should not be dependent on the number of training examples.
https://github.com/eriklindernoren/ML-From-Scratch/blob/f078fc384e3188922431e6747eefaa1561f361c4/mlfromscratch/supervised_learning/regression.py#L76