NorbertZheng / read-papers

My paper reading notes.
MIT License
8 stars 0 forks source link

CMU-10707 Deep Learning. #40

Open NorbertZheng opened 2 years ago

NorbertZheng commented 2 years ago

Related Reference

NorbertZheng commented 2 years ago

Introduction to Machine Learning, Regression

NorbertZheng commented 2 years ago

Important Breakthroughs

NorbertZheng commented 2 years ago

Representation Learning

Examples of Representation Learning

How to perform feature transformation on the data so that the data can be linearly separable in the feature space (feature transformation of data vectors, such as polynomial features, kernel methods, etc.) for the following learning algorithms. image

Feature Engineering

NorbertZheng commented 2 years ago

Types of Learning

Consider observing a series of input vectors:

$$ \mathbf{x}{1},\mathbf{x}{2},\mathbf{x}{3},\mathbf{x}{4},\cdot\cdot\cdot $$

NorbertZheng commented 2 years ago

Neural Networks I

NorbertZheng commented 2 years ago

Feed-forward Neural Networks

image

NorbertZheng commented 2 years ago

Activation Function

Range is determined by $g(\cdot)$. Bias only changes the position of the riff.

NorbertZheng commented 2 years ago

Universal Approximation

image image image

NorbertZheng commented 2 years ago

Gradient Descend

Perform updates after seeing each examples:

To train a neural net, we need:

NorbertZheng commented 2 years ago

A procedure to compute gradients: $\nabla_{\theta}l(f(\mathbf{x}^{(t)};\theta),y^{(t)})$. image Consider a network with $L$ hidden layers.

$$ \mathbf{a}^{(k)}(\mathbf{x})=\mathbf{b}^{(k)}+\mathbf{W}^{(k)}\mathbf{h}^{(k-1)}(\mathbf{x}) $$

$$ \mathbf{h}^{(k)}(\mathbf{x})=\mathbf{g}(\mathbf{a}^{(k)}(\mathbf{x})) $$

$$ \mathbf{h}^{(L+1)}(\mathbf{x})=\mathbf{o}(\mathbf{a}^{(L+1)}(\mathbf{x}))=f(\mathbf{x}) $$

Loss gradient at output. image

$$ \frac{\partial}{\partial f{c}(\mathbf{x})}-logf{y}(\mathbf{x})=-\frac{1{y=c}}{f{y}(\mathbf{x})} $$

$$ \begin{aligned} &\nabla{f(\mathbf{x})}-logf{y}(\mathbf{x})\ =&-\frac{1}{f{y}(\mathbf{x})}\left[\begin{matrix} 1{(y=0)}\ \cdot\ \cdot\ \cdot\ 1{(y=C-1)}\end{matrix}\right]\ =&-\frac{\mathbf{e}(y)}{f{y}(\mathbf{x})} \end{aligned} $$

Loss gradient at output pre-activation. image

$$ \frac{\partial}{\partial a^{(L+1)}(\mathbf{x})}-logf{y}(\mathbf{x})=-(1{(y=c)}-f_{c}(\mathbf{x})) $$

$$ \begin{aligned} &\nabla{\mathbf{a}^{(L+1)}(\mathbf{x})}-logf{y}(\mathbf{x})\ =&-(\mathbf{e}(y)-f(\mathbf{x})) \end{aligned} $$

NorbertZheng commented 2 years ago

Note that we have the following equation:

$$ \frac{\partial \frac{g(x)}{h(x)}}{\partial x}=\frac{\partial g(x)}{\partial x}\frac{1}{h(x)}-\frac{g(x)}{h(x)^{2}}\frac{\partial h(x)}{\partial x} $$

Then we can derive the following equation:

$$ \begin{aligned} &\frac{\partial}{\partial \mathbf{a}^{(L+1)}{c}(\mathbf{x})}-logf_{y}(\mathbf{x})\ =&\frac{-1}{f_{y}(\mathbf{x})}\frac{\partial}{\partial \mathbf{a}^{(L+1)}_{c}(\mathbf{x})}softmax(\mathbf{a}^{(L+1)}\{y}(\mathbf{x}))\ \end{aligned} $$