davidrosenberg / mlcourse

Machine learning course materials.
https://davidrosenberg.github.io/ml2018
560 stars 264 forks source link

Neural Networks Potentially Underemphasized #7

Closed brett1479 closed 6 years ago

brett1479 commented 7 years ago

I don't know the correct answer to this question, or whether Neural networks are best left to other courses. Given that deep networks are a particularly hot field in Machine Learning right now, I am wondering whether a student finishing this course should have had more of an exposure to the concepts (I agree they shouldn't have mastery).

davidrosenberg commented 7 years ago

Yeah, I think about this a lot. I think a lot of the most interesting neural network stuff has structured output spaces (e.g. sequences), which we only barely get to in this class, so are probably beyond our scope. Convolutional neural networks aren't that difficult to grasp, but it still doesn't quite seem core or fundamental enough to spend a full hour on in this class. And anything less could be stuffed into the neural networks overview.

So that leaves us with standard feedforward networks / multilayer perceptrons. I think if we teach back-propagation as our strategy for computing gradients in code, then making a neural network would be a straightforward extension of the code for regression. If we could make a corresponding homework problem or an entire assignment that really gets people understanding what's going on, that'd be great. We could discuss neural networks approximation properties.

Can our review treatment of computing gradients be absorbed into a back-propagation presentation? See for example Percy Liang's slides on backprop. And Karpathy's notes and YouTube.

brett1479 commented 7 years ago

Correct me if I'm wrong, but I think the gradient computation review is the first lab. If we incorporate back-propagation and neural networks, it seems like the NN material would be on an island that we don't revisit till the end of the course. Maybe you had something else in mind.

davidrosenberg commented 7 years ago

If we spend a full week on neural nets, we might have time to get into the very important issues of parameter initialization, gradient decay/explosion, batch normalization (and related ideas?)... just to give a flavor of the math involved. Also can give the intuiton about stochastic methods generalizing better than batch methods. (some empirical investigations suggest SGD finds local minimia in wide troughs, while batch methods find minima in narrow troughs, and handwavey explanations for why wide is better (stability — difference between training data and test corresponds to a shift in objective function, which may make your minima in a narrow trough suck in the population))