Open NorbertZheng opened 2 years ago
Deep Belief Networks, 2006 (Unsupervised)
How to perform feature transformation on the data so that the data can be linearly separable in the feature space (feature transformation of data vectors, such as polynomial features, kernel methods, etc.) for the following learning algorithms.
Consider observing a series of input vectors:
$$ \mathbf{x}{1},\mathbf{x}{2},\mathbf{x}{3},\mathbf{x}{4},\cdot\cdot\cdot $$
Unsupervised Learning: The goal is to build a statistical model of $\mathbf{x}$, which finds useful representation of data.
Range is determined by $g(\cdot)$. Bias only changes the position of the riff.
Linear activation function: $g(a)=a$.
Sigmoid activation function: $g(a)=sigm(a)=\frac{1}{1+\exp(-a)}$.
Hyperbolic tangent ("tanh") activation function: $g(a)=tanh(a)=\frac{\exp(a)-\exp(-a)}{\exp(a)+\exp(-a)}=\frac{\exp(2a)-1}{\exp(2a)+1}$.
Rectified linear (ReLU) activation function: $g(a)=reclin(a)=max(0,a)$.
Perform updates after seeing each examples:
To train a neural net, we need:
A procedure to compute gradients: $\nabla_{\theta}l(f(\mathbf{x}^{(t)};\theta),y^{(t)})$. Consider a network with $L$ hidden layers.
$$ \mathbf{a}^{(k)}(\mathbf{x})=\mathbf{b}^{(k)}+\mathbf{W}^{(k)}\mathbf{h}^{(k-1)}(\mathbf{x}) $$
$$ \mathbf{h}^{(k)}(\mathbf{x})=\mathbf{g}(\mathbf{a}^{(k)}(\mathbf{x})) $$
$$ \mathbf{h}^{(L+1)}(\mathbf{x})=\mathbf{o}(\mathbf{a}^{(L+1)}(\mathbf{x}))=f(\mathbf{x}) $$
Loss gradient at output.
$$ \frac{\partial}{\partial f{c}(\mathbf{x})}-logf{y}(\mathbf{x})=-\frac{1{y=c}}{f{y}(\mathbf{x})} $$
$$ \begin{aligned} &\nabla{f(\mathbf{x})}-logf{y}(\mathbf{x})\ =&-\frac{1}{f{y}(\mathbf{x})}\left[\begin{matrix} 1{(y=0)}\ \cdot\ \cdot\ \cdot\ 1{(y=C-1)}\end{matrix}\right]\ =&-\frac{\mathbf{e}(y)}{f{y}(\mathbf{x})} \end{aligned} $$
Loss gradient at output pre-activation.
$$ \frac{\partial}{\partial a^{(L+1)}(\mathbf{x})}-logf{y}(\mathbf{x})=-(1{(y=c)}-f_{c}(\mathbf{x})) $$
$$ \begin{aligned} &\nabla{\mathbf{a}^{(L+1)}(\mathbf{x})}-logf{y}(\mathbf{x})\ =&-(\mathbf{e}(y)-f(\mathbf{x})) \end{aligned} $$
Note that we have the following equation:
$$ \frac{\partial \frac{g(x)}{h(x)}}{\partial x}=\frac{\partial g(x)}{\partial x}\frac{1}{h(x)}-\frac{g(x)}{h(x)^{2}}\frac{\partial h(x)}{\partial x} $$
Then we can derive the following equation:
$$ \begin{aligned} &\frac{\partial}{\partial \mathbf{a}^{(L+1)}{c}(\mathbf{x})}-logf_{y}(\mathbf{x})\ =&\frac{-1}{f_{y}(\mathbf{x})}\frac{\partial}{\partial \mathbf{a}^{(L+1)}_{c}(\mathbf{x})}softmax(\mathbf{a}^{(L+1)}\{y}(\mathbf{x}))\ \end{aligned} $$
Related Reference