NicolaBernini / PapersAnalysis

Analysis, summaries, cheatsheets about relevant papers
21 stars 4 forks source link

Reading - Deep Learning without Poor Local Minima #28

Open NicolaBernini opened 4 years ago

NicolaBernini commented 4 years ago

Overview

Reading Deep Learning without Poor Local Minima : a paper which achieved very interesting theoretical results on DNN back in 2016

The abstract is very interesting

image

NOTE

For the best rendering please install and activate Tex all the things - Chrome Plugin which provides browser side math rendering

If it is active you should see the following inline math $a=b$ and math equation

$$ a x^{2} + b x + c = 0 \quad x \in \mathbb{R} $$

correctly

NicolaBernini commented 4 years ago

History of Deep Learning Training

TRAINING A 3-NODE NEURAL NETWORK IS NP-COMPLETE

so training Deep NN was classified as an intractable problem at least with the technology available at that time

It means there is no actual need to find the global minimum to make the DNN work, a local minimum is typically good enough for most of the practical purposes

This is anyway an empirical evidence: practically, when we train a DNN and it works well of course we can’t assume we have found the actual global minimum (as it is NP Complete) so we conclude we have found a local minimum and that’s good.

Classification of critical points

So to summarise, this means in a DNN loss landscape we have the following categories of critical points

The loss landscape properties - Paper contribution

This paper, is focused on proving properties of the loss landscape and more specifically about the minima, starting from assumptions

In this work, the assumptions are

and one of the core results is there are no suboptimal local minima, in fact quoting the paper

every local minimum is a global minimum

NOTE

We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

So here is a theoretical explanation about why training a DNN with a squared loss function is not hard after all: global minima are abundant.

The NP-completeness regards finding one specific global minimum among all the possible one, which is certainly not practically relevant.

The paper sketches the proof in 4.3.1 section, following a case a by case approach and showing that every time a point satisfies the local minimum definition then, in that model context, it also satisfies the global minimum definition

Unerstanding the loss function

Understanding the Loss Function Landscape is important to design fast and efficient optimization methods to navigate this landscape and find good local minima

Good Minima

In the context of Deep Learning, it is important to specify that a minimum is good when it allows the DNN to generalize beyond the training set (it is all about generalization)

Generalization can not be checked during training as by definition it involves data which is not in the training set, so it can only be checked ex post training on the test set.

Anyway this kind of approach apparently seems to work well so a quite surprising empirical evidence is that it is not only the global minimum (and local minima close enough) which is good but there is plenty of local minima in the landscape which allow the NN work well enough, so to generalize well.

Loss Function - Source of Hardness

How to approach the theoretical study of DNN, in order of growing complexity

  1. Linear Shallow DNN
  2. Non Linear Shallow DNN
  3. Deep Linear and Non Linear DNN

Challenges in the Navigation of the Loss Landscape

The complexity of this optimization process is the loss function landscape is highly non convex and while there are plenty of good local minima, so weights configs which make the DNN generalize well (it means there are quite a lot of “low hanging fruits” which is one key thing explaining the success of the first generation of DNNs: they are quite easy to train, with the current technology so GPUs, memory, ...) there are also saddle points making the navigation harder, especially “bad saddle points” which are the ones where the curvature is positive (hence representing basins)

image