Open bluejad opened 8 years ago
maximum likelihood estimation
transfer into sum
the empirical distribution p^data by defined the training data
KL divergence
This means when we train the model to minimize the KL divergence, we need only minimize
the conditional maximum likelihood estimator 1fr
the conditional maximum likelihood estimator 2sc
the conditional distribution
the conditional log-likel;ihood
解释最大似然估计的一种方法是将其视为最小化由训练集定义的经验分布pdata和模型分布之间的不相似性,两者之间的不相似度由KL散度测量
KL散度 KL divergence
Any loss consisting of a negative log-likelihood is a cross entropy between the empirical distribution defined by the training set and the model
任何由负对数似然组成的损失是由训练集和模型定义的经验分布之间的交叉熵
the true data generating distribution pdata mean squared error is the cross-entropy between the empirical distribution and a Gaussian model
均方误差是经验分布和高斯模型之间的交叉熵
the true data generating distribution pdata
真实数据生成分布pdata
p^(x;w)
the prediction of the mean of the Gaussian
高斯的均值预测
Comparing the log-likelihood with the mean squared error
Properties of Maximum Likelihood
p(θ)
priorprobability distribution
先验概率分布
the prior via Bayes' rule
the predicted distribution of next data
The prediction is parametrized by the vector w ∈ Rn
Given a set of m training samples(X(train),y(train)), we can express the prediction of y over the entire training set as
Expressed as a Gaussian conditional distribution on y(train), we have
For real-valued parameters it is common to use a Gaussian as a prior distribution
Maximum A Posteriori (MAP) Estimation
The MAP estimate chooses the point of maximal posterior probability (or maximal probability density in the more common case of continuous θ)
We recognize, above on the right hand side, logp(x | θ), i.e.the standard log-likelihood term, and log p(θ), corresponding to the prior distribution
Probabilistic Supervised Learning
Most supervised learning algorithms in this book are based on estimating a probability distribution p(y | x)
We can do this simply by using maximum likelihood estimation to find the best parameter vector θ for a parametric family of distributions p(y | x;θ)
We have already seen that linear regression corresponds to the family
One way to solve this problem is to use the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpret that value as a probability
negative log-likelihood (NLL)
Support Vector Machines
This model is similar to logistic regression in that it is driven by a linear function
The SVM predicts that the positive class is present when wTx +b is positive. Likewise, it predicts that the negative class is present when wTx+b is negative
One key innovation associated with support vector machines is the kernel trick
Rewriting the learning algorithm this way allows us to replace x by the output of a given feature function φ(x) and the dot product with a function k(x,x(i)) = φ(x)·φ(x(i)) called a kernel
以这种方式重写学习算法允许我们用给定特征函数φ(x)的输出和具有函数 k(x,x(i))=φ(x)·φ(x(i) ))称为内核
After replacing dot products with kernel evaluations, we can make predictions using the function
Other Simple Supervised Learning Algorithms
Unsupervised Learning Algorithms
The unbiased sample covariance matrix associated with X is given by
与X相关的无偏样本协方差矩阵由下式给出
PCA finds a representation (through linear transformation) z = xTW where Var[z] is diagonal
PCA表示(通过线性变换)z = xT W,其中Var [z]是对角线
we saw that the principal components of a design matrix X are given by the eigenvectors of XTX. From this view
WWT = I
Stochastic Gradient Descent
the negative conditional log-likelihood of the training data can be written as
The computational cost of this operation is O(m)
The estimate of the gradient is formed as
using examples from the minibatch B. The stochastic gradient descent algorithm then follows the estimated gradient downhill:
learning rate
Building a Machine Learning Algorithm
the linear regression algorithm combines a dataset consisting of X and y, the cost function
the model specification pmodel(y | x) = N (y;xTw+ b,1)
in most cases, the optimization algorithm defined by solving for where the gradient of the cost is zero using the normal equations
优化算法是通过求解成本的梯度为零
we can add weight decay to the linear regression cost function to obtain
This still allows closed-form optimization
we can obtain the first PCA vector by specifying that our loss function is
while our model is defined to have w with norm one and reconstruction function r(x) = wTxw
而我们的模型被定义为具有范数w和重建函数r(x)= wT xw
Challenges Motivating Deep Learning
As the number of relevant dimensions of the data increases (from left to right), the number of configurations of interest may grow exponentially
Illustration of how the nearest neighbor algorithm breaks up the input space into regions
最近邻算法如何将输入空间分解为区域的图示
Part II Deep Networks: Modern Practices
multilayer perceptrons (MLPs)
A feedforward network defines a mapping y = f(x;θ) and learns the value of the parameters θ that result in the best function approximation
前馈网络定义映射y = f(x;θ),并且学习导致最佳函数近似的参数θ的值
no feedback connections
Deep feedforward networks
feedforward neural networks
multilayer perceptrons (MLPs)
feedback connections
recurrent neural networks
The goal of a feedforward network is to approximate some function f∗
前馈网络的目的是近似某个函数f *
The dimensionality of these hidden layers determines the width of the model
这些隐藏层的维度决定了模型的宽度
Example: Learning XOR
The XOR function (“exclusive or”) is an operation on two binary values, x1 and x2. When exactly one of these binary values is equal to 1, the XOR function returns 1. Otherwise, it returns 0. The XOR function provides the target function y = f(x) that we want to learn. Our model provides a function y =f(x;θ) and ourlearning algorithm will adapt the parameters θ to make f assimilar as possible to f
Evaluated on our whole training set, the MSE loss function is
Now we must choose the form of our model, f(x;θ). Suppose that we choose a linear model, with θ consisting of w and b. Our model is defined to be f(x;w, b) = xTw + b
minimize J(θ)
four points X = {[0,0]T, [0,1]T,[1,0]T, and [1,1]T}
Answer w = 0 and b = 1/2
luckly me