Optimizers: Beyond Simple Gradient Descent

An introductory book on Machine Learning often motivates a cost function and finally explains using a simple convex surface how the gradient descent algorithm works in finding the solution to the minima of the cost function. But we all know that when using high level APIs such as Keras or Scikit-Learn, we use much more than the simple gradient descent. We casually write optimizer='adam' or optimizer='rmsprop' and the magic happens.

But what are these 'adam' and 'rmsprop'? Why not be happy with simple gradient descent or stochastic gradient descent? I propose a talk in which we can start by defining cost functions, the necessity to minimize / optimize them, the basics of gradient descent and then point out the problems of this simple algorithm which get even more severe when training deep learning networks. We can slowly present ideas that overcome these problems eventually leading to the state of the art default optimizers being used behind the scenes today.

Proposed Mode of Delivery: White board Proposed Duration: 45 - 60 minutes Background Required: An awareness of Machine Learning basics, Basic Calculus (can be covered if needed).

PyDataPune / Talks

Optimizers: Beyond Simple Gradient Descent #32