NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

CMU-10707 Deep Learning. #40

Open NorbertZheng opened 1 year ago

NorbertZheng commented 1 year ago

Related Reference

NorbertZheng commented 1 year ago

Introduction to Machine Learning, Regression

NorbertZheng commented 1 year ago

Important Breakthroughs

NorbertZheng commented 1 year ago

Representation Learning

Examples of Representation Learning

How to perform feature transformation on the data so that the data can be linearly separable in the feature space (feature transformation of data vectors, such as polynomial features, kernel methods, etc.) for the following learning algorithms. image

Feature Engineering

NorbertZheng commented 1 year ago

Types of Learning

Consider observing a series of input vectors:

$$ \mathbf{x}{1},\mathbf{x}{2},\mathbf{x}{3},\mathbf{x}{4},\cdot\cdot\cdot $$

NorbertZheng commented 1 year ago

Neural Networks I

NorbertZheng commented 1 year ago

Feed-forward Neural Networks

image

NorbertZheng commented 1 year ago

Activation Function

Range is determined by $g(\cdot)$. Bias only changes the position of the riff.

NorbertZheng commented 1 year ago

Universal Approximation

image image image

NorbertZheng commented 1 year ago

Gradient Descend

Perform updates after seeing each examples:

To train a neural net, we need:

NorbertZheng commented 1 year ago

A procedure to compute gradients: $\nabla_{\theta}l(f(\mathbf{x}^{(t)};\theta),y^{(t)})$. image Consider a network with $L$ hidden layers.

$$ \mathbf{a}^{(k)}(\mathbf{x})=\mathbf{b}^{(k)}+\mathbf{W}^{(k)}\mathbf{h}^{(k-1)}(\mathbf{x}) $$

$$ \mathbf{h}^{(k)}(\mathbf{x})=\mathbf{g}(\mathbf{a}^{(k)}(\mathbf{x})) $$

$$ \mathbf{h}^{(L+1)}(\mathbf{x})=\mathbf{o}(\mathbf{a}^{(L+1)}(\mathbf{x}))=f(\mathbf{x}) $$

Loss gradient at output. image

$$ \frac{\partial}{\partial f{c}(\mathbf{x})}-logf{y}(\mathbf{x})=-\frac{1{y=c}}{f{y}(\mathbf{x})} $$

$$ \begin{aligned} &\nabla{f(\mathbf{x})}-logf{y}(\mathbf{x})\ =&-\frac{1}{f{y}(\mathbf{x})}\left[\begin{matrix} 1{(y=0)}\ \cdot\ \cdot\ \cdot\ 1{(y=C-1)}\end{matrix}\right]\ =&-\frac{\mathbf{e}(y)}{f{y}(\mathbf{x})} \end{aligned} $$

Loss gradient at output pre-activation. image

$$ \frac{\partial}{\partial a^{(L+1)}(\mathbf{x})}-logf{y}(\mathbf{x})=-(1{(y=c)}-f_{c}(\mathbf{x})) $$

$$ \begin{aligned} &\nabla{\mathbf{a}^{(L+1)}(\mathbf{x})}-logf{y}(\mathbf{x})\ =&-(\mathbf{e}(y)-f(\mathbf{x})) \end{aligned} $$

NorbertZheng commented 1 year ago

Note that we have the following equation:

$$ \frac{\partial \frac{g(x)}{h(x)}}{\partial x}=\frac{\partial g(x)}{\partial x}\frac{1}{h(x)}-\frac{g(x)}{h(x)^{2}}\frac{\partial h(x)}{\partial x} $$

Then we can derive the following equation:

$$ \begin{aligned} &\frac{\partial}{\partial \mathbf{a}^{(L+1)}{c}(\mathbf{x})}-logf_{y}(\mathbf{x})\ =&\frac{-1}{f_{y}(\mathbf{x})}\frac{\partial}{\partial \mathbf{a}^{(L+1)}_{c}(\mathbf{x})}softmax(\mathbf{a}^{(L+1)}\{y}(\mathbf{x}))\ \end{aligned} $$