History of Deep Learning Training

It is well known that the Loss Function for DNN Training is highly non convex
In 1986, Murty, Katta G, & Kabadi, Santosh showed finding the global optimum of a generic non convex function is NP Complete
- Generic means with no specific structure to exploit to make the problem easier
Unfortunately in 1992 Blum and Rivest showed (quoting the paper)

TRAINING A 3-NODE NEURAL NETWORK IS NP-COMPLETE

so training Deep NN was classified as an intractable problem at least with the technology available at that time

But we all know DNN are now super trendy as they work very well in practice ... but why? Certainly the technology has evolved a lot since then: we now have GPU allowing us to train way faster than in the past, but this is certainly not enough to explore the super huge weights space of currently big DNN
So what's the secret behind the recent success of DNN Training?
They key is: apparently, there are quite a lot of low hanging fruits meaning that there is an abundance of local minima which make DNN work well

It means there is no actual need to find the global minimum to make the DNN work, a local minimum is typically good enough for most of the practical purposes

This is anyway an empirical evidence: practically, when we train a DNN and it works well of course we can’t assume we have found the actual global minimum (as it is NP Complete) so we conclude we have found a local minimum and that’s good.

Classification of critical points

So to summarise, this means in a DNN loss landscape we have the following categories of critical points

local minima
- poor local minima, where the DNN performs not well
- good local minima, where the DNN performs well
global minimum / minima, where the DNN is expected to perform optimally (but this could actually not be the case, as they could be connected to overfitting)
saddle points
- good saddle points, which are easy to navigate
- bad saddle points, which are hard to navigate

The loss landscape properties - Paper contribution

This paper, is focused on proving properties of the loss landscape and more specifically about the minima, starting from assumptions

In this work, the assumptions are

Aìrchitecture: Deep Linerar Neural Network (so no nonlinearities)
Loss Function: squared
Width: any
Depth: any

and one of the core results is there are no suboptimal local minima, in fact quoting the paper

every local minimum is a global minimum

NOTE

The assumption to work on Deep Linear Neural Network instead of NN with nonlinearities (the type used in practice) makes this work still relevant as both DLNN and DNN have similar dynamic behaviour during training quoting Exact solutions to the nonlinear dynamics of learning in deep linear neural networks - Saxe

We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

So here is a theoretical explanation about why training a DNN with a squared loss function is not hard after all: global minima are abundant.

The NP-completeness regards finding one specific global minimum among all the possible one, which is certainly not practically relevant.

The paper sketches the proof in 4.3.1 section, following a case a by case approach and showing that every time a point satisfies the local minimum definition then, in that model context, it also satisfies the global minimum definition

Unerstanding the loss function

Understanding the Loss Function Landscape is important to design fast and efficient optimization methods to navigate this landscape and find good local minima

Good Minima

In the context of Deep Learning, it is important to specify that a minimum is good when it allows the DNN to generalize beyond the training set (it is all about generalization)

Generalization can not be checked during training as by definition it involves data which is not in the training set, so it can only be checked ex post training on the test set.

Anyway this kind of approach apparently seems to work well so a quite surprising empirical evidence is that it is not only the global minimum (and local minima close enough) which is good but there is plenty of local minima in the landscape which allow the NN work well enough, so to generalize well.

Loss Function - Source of Hardness

How to approach the theoretical study of DNN, in order of growing complexity

Linear Shallow DNN
Non Linear Shallow DNN
Deep Linear and Non Linear DNN

Challenges in the Navigation of the Loss Landscape

The complexity of this optimization process is the loss function landscape is highly non convex and while there are plenty of good local minima, so weights configs which make the DNN generalize well (it means there are quite a lot of “low hanging fruits” which is one key thing explaining the success of the first generation of DNNs: they are quite easy to train, with the current technology so GPUs, memory, ...) there are also saddle points making the navigation harder, especially “bad saddle points” which are the ones where the curvature is positive (hence representing basins)

NicolaBernini / PapersAnalysis

Reading - Deep Learning without Poor Local Minima #28

Overview

NOTE