deep learning/Bengio by deeppurple

bluejad commented 8 years ago

luckly me

bluejad commented 8 years ago

maximum likelihood estimation

maximum likelihood estimation 1fr

transfer into sum

maximum likelihood estimation 2sc

the empirical distribution p^data by defined the training data

expectation of empirical distribution pdata

bluejad commented 8 years ago

KL divergence

kl divergenc

bluejad commented 8 years ago

This means when we train the model to minimize the KL divergence, we need only minimize

for minimum kl only need minimum this pic

bluejad commented 8 years ago

the conditional maximum likelihood estimator 1fr

the conditional maximum likelihood estimator 2sc

bluejad commented 8 years ago

the conditional distribution

the conditioal distribution

bluejad commented 8 years ago

the conditional log-likel;ihood

the conditional log-likelihood

bluejad commented 8 years ago

解释最大似然估计的一种方法是将其视为最小化由训练集定义的经验分布pdata和模型分布之间的不相似性，两者之间的不相似度由KL散度测量

KL散度 KL divergence

kl divergenc

bluejad commented 8 years ago

Any loss consisting of a negative log-likelihood is a cross entropy between the empirical distribution deﬁned by the training set and the model

任何由负对数似然组成的损失是由训练集和模型定义的经验分布之间的交叉熵

bluejad commented 8 years ago

the true data generating distribution pdata mean squared error is the cross-entropy between the empirical distribution and a Gaussian model

均方误差是经验分布和高斯模型之间的交叉熵

bluejad commented 8 years ago

the true data generating distribution pdata

真实数据生成分布pdata

bluejad commented 8 years ago

p^(x；w)

the prediction of the mean of the Gaussian

高斯的均值预测

bluejad commented 8 years ago

Comparing the log-likelihood with the mean squared error

comparing the log-likelihood with the mean squared error

bluejad commented 8 years ago

Properties of Maximum Likelihood

The true distribution pdata must lie within the model family pmodel(·; θ). Otherwise, no estimator can recover pdata.
The true distribution pdata must correspond to exactly one value of θ. Otherwise, maximum likelihood can recover the correct pdata, but will not be able to determine which value of θ was used by the data generating processing.

bluejad commented 8 years ago

p(θ)

priorprobability distribution

先验概率分布

bluejad commented 8 years ago

the prior via Bayes' rule

the prior via bayes rule

bluejad commented 8 years ago

the predicted distribution of next data

bluejad commented 8 years ago

The prediction is parametrized by the vector w ∈ Rn

the prediction is parametrized by the vector w rn

bluejad commented 8 years ago

Given a set of m training samples(X(train),y(train)), we can express the prediction of y over the entire training set as

given a set of m training samples x train y train we can express the predictionof y over the entire training set as

bluejad commented 8 years ago

Expressed as a Gaussian conditional distribution on y(train), we have

expressed as a gaussian conditional distribution on y train

bluejad commented 8 years ago

For real-valued parameters it is common to use a Gaussian as a prior distribution

for real-valued parameters it is common to use a gaussian as a prior distribution

bluejad commented 8 years ago

Maximum A Posteriori (MAP) Estimation

bluejad commented 8 years ago

The MAP estimate chooses the point of maximal posterior probability (or maximal probability density in the more common case of continuous θ)

the map estimate chooses the point of maximal posterior probability or maximal probability density in the more common case of continuous

We recognize, above on the right hand side, logp(x | θ), i.e.the standard log-likelihood term, and log p(θ), corresponding to the prior distribution

bluejad commented 8 years ago

Probabilistic Supervised Learning

Most supervised learning algorithms in this book are based on estimating a probability distribution p(y | x)

We can do this simply by using maximum likelihood estimation to ﬁnd the best parameter vector θ for a parametric family of distributions p(y | x;θ)

We have already seen that linear regression corresponds to the family

we have already seen that linear regression corresponds to the family

One way to solve this problem is to use the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpret that value as a probability

one way to solve this problem is to use

bluejad commented 8 years ago

negative log-likelihood (NLL)

bluejad commented 8 years ago

Support Vector Machines

This model is similar to logistic regression in that it is driven by a linear function

support vector machines

The SVM predicts that the positive class is present when wTx +b is positive. Likewise, it predicts that the negative class is present when wTx+b is negative

One key innovation associated with support vector machines is the kernel trick

one key innovation associated with support vector machines is the kernel trick

Rewriting the learning algorithm this way allows us to replace x by the output of a given feature function φ(x) and the dot product with a function k(x,x(i)) = φ(x)·φ(x(i)) called a kernel

以这种方式重写学习算法允许我们用给定特征函数φ（x）的输出和具有函数 k（x，x（i））=φ（x）·φ（x（i）））称为内核

After replacing dot products with kernel evaluations, we can make predictions using the function

after replacing dot products with kernel evaluations we can make predictions using the function

bluejad commented 8 years ago

Other Simple Supervised Learning Algorithms

other simple supervised learning algorithms 1fr

other simple supervised learning algorithms 2sc

bluejad commented 8 years ago

Unsupervised Learning Algorithms

The unbiased sample covariance matrix associated with X is given by

与X相关的无偏样本协方差矩阵由下式给出

the unbiased sample covariance matrix associated with x is given by

PCA ﬁnds a representation (through linear transformation) z = xTW where Var[z] is diagonal

PCA表示（通过线性变换）z = xT W，其中Var [z]是对角线

we saw that the principal components of a design matrix X are given by the eigenvectors of XTX. From this view

let w be theright singular vectors in the decomposition x u w

WWT = I

bluejad commented 8 years ago

Stochastic Gradient Descent

the negative conditional log-likelihood of the training data can be written as

The computational cost of this operation is O(m)

The estimate of the gradient is formed as

the estimate of the gradient is formed as

using examples from the minibatch B. The stochastic gradient descent algorithm then follows the estimated gradient downhill:

the stochastic gradient descent algorithm s is the learning rate

learning rate

bluejad commented 8 years ago

Building a Machine Learning Algorithm

the linear regression algorithm combines a dataset consisting of X and y, the cost function

the linear regression algorithm combines a dataset consisting of x and y the cost function pmodel y x n y xw b 1

the model speciﬁcation pmodel(y | x) = N (y;xTw+ b,1)

in most cases, the optimization algorithm deﬁned by solving for where the gradient of the cost is zero using the normal equations

优化算法是通过求解成本的梯度为零

we can add weight decay to the linear regression cost function to obtain

This still allows closed-form optimization

we can obtain the ﬁrst PCA vector by specifying that our loss function is

we can add weight decay to the linear regression cost function to obtain

while our model is deﬁned to have w with norm one and reconstruction function r(x) = wTxw

而我们的模型被定义为具有范数w和重建函数r（x）= wT xw

bluejad commented 8 years ago

Challenges Motivating Deep Learning

As the number of relevant dimensions of the data increases (from left to right), the number of conﬁgurations of interest may grow exponentially

as the number of relevant dimensions of the data increases from left to

Illustration of how the nearest neighbor algorithm breaks up the input space into regions

最近邻算法如何将输入空间分解为区域的图示

illustration of how the nearest neighbor algorithm breaks up the input space

bluejad commented 8 years ago

Part II Deep Networks: Modern Practices

bluejad commented 8 years ago

multilayer perceptrons (MLPs)

bluejad commented 8 years ago

A feedforward network deﬁnes a mapping y = f(x；θ) and learns the value of the parameters θ that result in the best function approximation

前馈网络定义映射y = f（x；θ），并且学习导致最佳函数近似的参数θ的值

bluejad commented 8 years ago

no feedback connections

Deep feedforward networks

feedforward neural networks

multilayer perceptrons (MLPs)

bluejad commented 8 years ago

feedback connections

recurrent neural networks

bluejad commented 8 years ago

The goal of a feedforward network is to approximate some function f∗

前馈网络的目的是近似某个函数f *

bluejad commented 8 years ago

The dimensionality of these hidden layers determines the width of the model

这些隐藏层的维度决定了模型的宽度

bluejad commented 8 years ago

Example: Learning XOR

The XOR function (“exclusive or”) is an operation on two binary values, x1 and x2. When exactly one of these binary values is equal to 1, the XOR function returns 1. Otherwise, it returns 0. The XOR function provides the target function y = f(x) that we want to learn. Our model provides a function y =f(x；θ) and ourlearning algorithm will adapt the parameters θ to make f assimilar as possible to f

Evaluated on our whole training set, the MSE loss function is

evaluated on our whole training set the mse loss function is

Now we must choose the form of our model, f(x；θ). Suppose that we choose a linear model, with θ consisting of w and b. Our model is deﬁned to be f(x；w, b) = xTw + b

minimize J(θ)

four points X = {[0,0]T, [0,1]T,[1,0]T, and [1,1]T}

Answer w = 0 and b = 1/2

bluejad / deeplearning

deep learning/Bengio by deeppurple #1