Open jeroenvanriel opened 2 years ago
Thanks!
Some other things that popped up:
(In the video I also put the origin at the intersection point. Since this is just an affine hyperplane I can do that if I subtract the supporting vector w_0, but in hindsight that might be less clear :). The simpler explanation above might be better.)
Hmm, that figure was meant to show the more general case of a convex function and linear constraints more generally, but you're probably right that it's easier to follow if I demonstrate this specific case. I'll try to replace that figure with a plot.
Yes, using the same standardized data, the coefficients of the overfitted model would be larger than those of a well-fitted model. (That would actually still hold if the data was not standardized). I don't know an absolute definition of 'large' in this case, that completely depends on the data...
Some questions and suggestions came to mind when I read about the gradient descent method:
In section Gradient Descent, I find the formulation of the exponential decay of the learning rate a little bit odd. I would suggest expressing \eta_s in terms of \eta_0 instead.
In section Stochastic Gradient Descent (SGD), I believe it would be better to start the index at i=0, otherwise it would make more sense to divide by n+1 when averaging the individual losses. Same goes for the other two sums in that part.
Furthermore, the "incremental gradient" method does look a lot like the SAG method described here instead of the incremental aggregated gradient (IAG) method from this paper, which I found confusing. I also found the SAGA Algorithm. Maybe adding some of these references will be helpful to other students.
Another suggestion would be to change "random i" to "if i = i_s" and adding "with i_s randomly chosen per iteration".