Closed NorbertZheng closed 2 years ago
Misra Rishabh. Inference using EM algorithm. Nasrabadi NM. Pattern recognition and machine learning. 史博. EM算法存在的意义是什么?
Expectation-Maximization (EM) algorithm is a powerful algorithm in statistical analysis. It is powerful in the sense that it has the ability to deal with missing data and unobserved features, the use-cases for which come up frequently in many real-world applications.
First, we should understand how parameters are estimated in simple use-cases. Thus, we fist consider the case where
Suppose we have a dataset with each data point consisting of a d
dimensional feature vector X
and an associated target variable Y∈{0,1}
. A graphical representation of this model is depicted following:
As we know, the prediction probability of the target variable in logistic regression is given by a sigmoid function like the following:
where w
is the weight vector to be assigned. Once we estimate the parameters w
, we can produce output for unobserved data points depending upon whether Y=0
or Y=1
gets more probability. And we use the Maximum Likelihood approach to approximate w
, e.g. finding parameters w
that maximize the likelihood (or probability) of the observed data P(data)
.
For mathematical convenience, we'll consider maximizing the log of likelihood. Parameterized by w
, e.g. w
parameterize the joint distribution p(x,y)
, log-likelihood can be written as:
Here the likelihood of observed data is represented as the multiplication of the likelihood of individual data points under the assumption that data samples are independently and identically distributed (so-called i.i.d.). Then we get:
where we used the knowledge that P(Y=yi|X=xi)
evaluates to σ(w.xi)
when yi=1
, else σ(-w.xi)
.
At this point, we should note that, (but not be achievable in many realistic scenarios)
L(w)
, conveniently breaks down into per-instance form.Since L(w)
is a function of w
, we don't have any closed-form solution. Thus, we would have to use iterative optimization methods like Gradient Ascent to find w
. An update for the gradient ascent method would look like:
We repeat the procedure in this equation until convergence and w
obtained at the end is called the maximum likelihood estimate (MLE).
In most of the realistic scenarios, it is common to
Having latent features makes difference in the estimation of parameters. Estimating model parameters does get a little tricky if latent features (or missing data) are involved.
Let V
be the set of observed variables, Z
be the set of latent variables and θ
be the set of model parameters. If we consider the maximum likelihood approach for the parameter estimation, our objective function will be:
We can see that the parameters are coupled because of summation inside the log. This makes the optimization using Gradient Ascent (or any iterative optimization technique in general) intractable. This means that many realistic scenarios need a more powerful technique to infer the parameters.
EM algorithm uses the fact that optimization of complete data log-likelihood P(V,Z|θ)
is much easier when we know the value of Z
(thus, removing the summation from inside the log).
However, since the only way to know the value of Z
is through posterior P(Z|V,θ)
, we instead consider the expected value of complete data log likelihood under the posterior distribution of latent variables. The step of finding the expectation is called the E-step. In the subsequent M-step, we maximize this expectation to optimize θ
.
Formally, the EM algorithm can be written as:
θ^{old}
.P(Z|V,θ^{old})
.θ^{new}
given by
θ^{old}=θ^{new}
and return to the E-step.Before diving deep, first, we will derive a property that will come in handy while explaining the E and M steps.
Let us consider a distribution q(Z)
over the latent variables. Independet of choice of q(Z)
, we can decompose the likelihood of the observed data in the following fashion:
V
and Z
, e.g. p(V,Z)|θ
.Z
given V
, e.g. p(Z|V,θ)
.One of the properties of KL divergence is that it's always non-negative. Using this property, we can deduce that
This means that L(q,θ)
, note that this is not L(θ)
, acts as a lower bound on the log-likelihood of the observed data. This observation, very shortly, would help in demonstrating that the EM algorithm does indeed maximize the log likelihood L(θ)
.
Suppose the initial value of the parameter vector θ
is θ^{old}
, step 1. Keeping in mind the relation, the E-step tries to maximize the lower bound L(q,θ^{old})
of L(θ)
with respect to q
while holding θ^{old}
fixed.
lnp(V|θ)
, e.g. L(θ)
, does not depend on q(Z)
, e.g. we can change q(Z)
at will.L(q,θ^{old})
will occur when the KL divergence vanishes, or in other words when q(Z)
is equal to the posterior distribution p(Z|V,θ^{old})
.Thus, E-step involves evaluating p(Z|V,θ^{old})
, step 2.
In this step, the distribution q(Z)
is held fixed. If we substitute q(Z)=p(Z|V,θ^{old})
from E-step into the expression of L(q,θ)
, we see that the lower bound tasks the following form:
where the constant is the negative entropy of the q
distribution and is therefore independent of θ
.
So, in the M-step, we maximize the lower bound L(q,θ)
with respect to θ
to give some new value θ^{new}
. This will cause the lower bound L(q,θ)
to increase, which will necessarily cause the corresponding log-likelihood function to increase, step 3.
Since the distribution q
is held fixed during the M-step, it will not equal the new posterior distribution p(Z|V,θ^{new})
, and hence there will be a non-zero KL divergence. So, we repeat the E and M steps again until convergence, step 4.
Here, I'll try to summarize the discussion with the help of the following figure that should help in connecting the dots.
The red curve depicts the incomplete data log-likelihood, lnp(V|θ)
, which we want to maximize.
θ^{old}
.p(Z|V,θ^{old})
, which gives rise to a lower bound L(q,θ^{old})
whose value equals the log likelihood at θ^{old}
, as shown by the blue curve.θ^{new}
, which gives a larger value of log-likelihood, e.g. L(θ)
, than θ^{old}
.θ^{new}
as shown by the green curve.At each step, we see that the obtained parameters increase the log-likelihood and the process continues until convergence.
See here for more details about the mathematical derivation.
Do CB, et. al. What is the expectation maximization algorithm?