CMU-10707 Deep Learning.

NorbertZheng commented 2 years ago

Related Reference

Goodfellow I, Bengio Y, Courville A. [Deep learning]().
Bishop C M, Nasrabadi N M. [Pattern recognition and machine learning]().
Murphy K P. [Machine learning: a probabilistic perspective]().
Hastie T, Tibshirani R, Friedman J H, et al. [The elements of statistical learning: data mining, inference, and prediction]().
MacKay D J C, Mac Kay D J C. [Information theory, inference and learning algorithms]().
Bruna J. Stat212b: Topics Course on Deep Learning.
Larochelle H. Online Course on Neural Networks.
Anonymous. Deep Learning Summer School @ Montreal, Canada.

NorbertZheng commented 2 years ago

Introduction to Machine Learning, Regression

Salakhutdinov R. Introduction to Machine Learning, Regression.

NorbertZheng commented 2 years ago

Important Breakthroughs

Deep Belief Networks, 2006 (Unsupervised)
- Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets.
- Theoretical breakthrough: Adding additional layers improves variational lower-bound.
- Efficient learning and inference with multiple layers:
  - Efficient greedy layer-by-layer learning algorithm.
  - Inferring the states of the hidden variables in the top most layer is easy,
Deep Convolutional Nets for Vision (Supervised).
- Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks.
Deep Nets for Speech (Supervised).
- Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.

NorbertZheng commented 2 years ago

Representation Learning

Examples of Representation Learning

How to perform feature transformation on the data so that the data can be linearly separable in the feature space (feature transformation of data vectors, such as polynomial features, kernel methods, etc.) for the following learning algorithms.

Feature Engineering

Computer Vision Features.
Audio Features.
Representation Learning: Can we automatically learn these representations?

NorbertZheng commented 2 years ago

Types of Learning

Consider observing a series of input vectors:

$$ \mathbf{x}{1},\mathbf{x}{2},\mathbf{x}{3},\mathbf{x}{4},\cdot\cdot\cdot $$

Supervised Learning: We are also given target outputs (labels, responses), $y{1},y{2},\cdot\cdot\cdot$, and the goal is to predict correct output given a new input.
- Classification: Target outputs $y_{i}$ are discrete class labels. The goal is to correctly classify new inputs.
- Regression: Target outputs $y_{i}$ are continuous. The goal is to predict the output given new inputs.
Unsupervised Learning: The goal is to build a statistical model of $\mathbf{x}$, which finds useful representation of data.
- Clustering.
- Dimensionality reduction.
- Modeling the data density.
- Finding hidden causes (useful explanation) of the data.
Reinforcement Learning: The model (agent) produces a set of actions, $a{1},a{2},\cdot\cdot\cdot$ that affect the state of the world, and receives rewards $r{1},r{2},\cdot\cdot\cdot$. The goal is to learn actions that maximize the reward.
Semi-supervised Learning: We are given only a limited amount of labels, but lots of unlabeled data.

NorbertZheng commented 2 years ago

Neural Networks I

Salakhutdinov R. Neural Networks I.

NorbertZheng commented 2 years ago

Feed-forward Neural Networks

NorbertZheng commented 2 years ago

Activation Function

Range is determined by $g(\cdot)$. Bias only changes the position of the riff.

Linear activation function: $g(a)=a$.
- No nonlinear transformation.
- No input squashing.
Sigmoid activation function: $g(a)=sigm(a)=\frac{1}{1+\exp(-a)}$.
- Squashes the neuron's output between 0 and 1.
- Always positive.
- Bounded.
- Strictly increasing.
Hyperbolic tangent ("tanh") activation function: $g(a)=tanh(a)=\frac{\exp(a)-\exp(-a)}{\exp(a)+\exp(-a)}=\frac{\exp(2a)-1}{\exp(2a)+1}$.
- Squashes the neuron's activation between -1 and 1.
- Can be positive or negative.
- Bounded.
- Strictly increasing (wrong plot).
Rectified linear (ReLU) activation function: $g(a)=reclin(a)=max(0,a)$.
- Bounded below by 0 (always non-negative).
- Tends to produce units with sparse activities.
- Not upper bounded.
- Strictly increasing.
Softmax Activation Function: $g(a)=softmax(a)=\left[\frac{\exp(a{1})}{\sum{c}\exp(a{c})},\cdot\cdot\cdot,\frac{\exp(a{C})}{\sum{c}\exp(a{c})}\right]^{T}$.
- Strictly positive.
- Sums to one.
- Support multi-way classification, e.g. discriminative learning.

NorbertZheng commented 2 years ago

Universal Approximation

Universal Approximation Theorem (Horni, 1991):
- A single hidden layer neural network with a linear output unit can approximate any continuous arbitrarily well, given enough units.
This applies for sigmoid, tanh and many other activation functions.
However, this does not mean that there is learning algorithm that can find the necessary parameter values.

NorbertZheng commented 2 years ago

Gradient Descend

Perform updates after seeing each examples:

Initialize: $\theta=\{\mathbf{W}^{(1)},\mathbf{b}^{(1)},\cdot\cdot\cdot,\mathbf{W}^{(L+1)},\mathbf{b}^{(L+1)}\}$.
For $t=1:T$
- For each training example $(\mathbf{x}^{(t)},y^{(t)})$, e.g. training epoch (iteration of all examples)
  - $\Delta=-\nabla{\theta}l(f(\mathbf{x}^{(t)};\theta),y^{(t)})-\lambda\nabla{\theta}\Omega(\theta)$
  - $\theta \leftarrow \theta + \alpha\Delta$

To train a neural net, we need:

Loss function & Regularizer: $l(\mathbf{f}(\mathbf{x}^{(t)};\theta),y^{(t)})$, $\Omega(\theta)$.
- Let us start by considering a classification problem with a softmax output layer.
- We need to estimate: $f_{c}(\mathbf{x})=p(y=c|\mathbf{x})$.
  - We can maximize the log-probability of the correct class given an input: $logp(y^{(t)}=c|\mathbf{x}^{t})$.
  - Alternatively, we can minimize the negative log-likelihood (e.g. cross-entropy function of multi-class classification problem): $l(f(\mathbf{x}),y)=-\sum{c}1{(y=c)}logf{c}(\mathbf{x})=-logf{y}(\mathbf{x})$.

NorbertZheng commented 2 years ago

A procedure to compute gradients: $\nabla_{\theta}l(f(\mathbf{x}^{(t)};\theta),y^{(t)})$. Consider a network with $L$ hidden layers.

Hidden layer ( $k \in [1,L]$ ) pre-activation:

$$ \mathbf{a}^{(k)}(\mathbf{x})=\mathbf{b}^{(k)}+\mathbf{W}^{(k)}\mathbf{h}^{(k-1)}(\mathbf{x}) $$

Hidden layer ( $k \in [1,L]$ ) activation:

$$ \mathbf{h}^{(k)}(\mathbf{x})=\mathbf{g}(\mathbf{a}^{(k)}(\mathbf{x})) $$

Output layer ( $k=L+1$ ) activation:

$$ \mathbf{h}^{(L+1)}(\mathbf{x})=\mathbf{o}(\mathbf{a}^{(L+1)}(\mathbf{x}))=f(\mathbf{x}) $$

Loss gradient at output.

Partial derivative:

$$ \frac{\partial}{\partial f{c}(\mathbf{x})}-logf{y}(\mathbf{x})=-\frac{1{y=c}}{f{y}(\mathbf{x})} $$

Gradient:

$$ \begin{aligned} &\nabla{f(\mathbf{x})}-logf{y}(\mathbf{x})\ =&-\frac{1}{f{y}(\mathbf{x})}\left[\begin{matrix} 1{(y=0)}\ \cdot\ \cdot\ \cdot\ 1{(y=C-1)}\end{matrix}\right]\ =&-\frac{\mathbf{e}(y)}{f{y}(\mathbf{x})} \end{aligned} $$

Loss gradient at output pre-activation.

Partial derivative:

$$ \frac{\partial}{\partial a^{(L+1)}(\mathbf{x})}-logf{y}(\mathbf{x})=-(1{(y=c)}-f_{c}(\mathbf{x})) $$

Gradient:

$$ \begin{aligned} &\nabla{\mathbf{a}^{(L+1)}(\mathbf{x})}-logf{y}(\mathbf{x})\ =&-(\mathbf{e}(y)-f(\mathbf{x})) \end{aligned} $$

NorbertZheng commented 2 years ago

Note that we have the following equation:

$$ \frac{\partial \frac{g(x)}{h(x)}}{\partial x}=\frac{\partial g(x)}{\partial x}\frac{1}{h(x)}-\frac{g(x)}{h(x)^{2}}\frac{\partial h(x)}{\partial x} $$

Then we can derive the following equation:

$$ \begin{aligned} &\frac{\partial}{\partial \mathbf{a}^{(L+1)}{c}(\mathbf{x})}-logf_{y}(\mathbf{x})\ =&\frac{-1}{f_{y}(\mathbf{x})}\frac{\partial}{\partial \mathbf{a}^{(L+1)}_{c}(\mathbf{x})}softmax(\mathbf{a}^{(L+1)}\{y}(\mathbf{x}))\ \end{aligned} $$

NorbertZheng / read-papers