Nat Biot '08 | What is the expectation maximization algorithm?

NorbertZheng commented 2 years ago

Do CB, et. al. What is the expectation maximization algorithm?

NorbertZheng commented 2 years ago

Reference

Misra Rishabh. Inference using EM algorithm. Nasrabadi NM. Pattern recognition and machine learning. 史博. EM算法存在的意义是什么?

NorbertZheng commented 2 years ago

Introduction

Expectation-Maximization (EM) algorithm is a powerful algorithm in statistical analysis. It is powerful in the sense that it has the ability to deal with missing data and unobserved features, the use-cases for which come up frequently in many real-world applications.

Upon finishing this note, you will be able to understand many research papers which solve interesting problems using Probabilistic Graphical Models, and also acquire the ability to develop and train such model yourself.

NorbertZheng commented 2 years ago

The case of observed features

First, we should understand how parameters are estimated in simple use-cases. Thus, we fist consider the case where

Data does not contain any missing values.
Model does not have any latent features.

Suppose we have a dataset with each data point consisting of a d dimensional feature vector X and an associated target variable Y∈{0,1}. A graphical representation of this model is depicted following:

As we know, the prediction probability of the target variable in logistic regression is given by a sigmoid function like the following:

where w is the weight vector to be assigned. Once we estimate the parameters w, we can produce output for unobserved data points depending upon whether Y=0 or Y=1 gets more probability. And we use the Maximum Likelihood approach to approximate w, e.g. finding parameters w that maximize the likelihood (or probability) of the observed data P(data).

For mathematical convenience, we'll consider maximizing the log of likelihood. Parameterized by w, e.g. w parameterize the joint distribution p(x,y), log-likelihood can be written as:

Here the likelihood of observed data is represented as the multiplication of the likelihood of individual data points under the assumption that data samples are independently and identically distributed (so-called i.i.d.). Then we get:

where we used the knowledge that P(Y=yi|X=xi) evaluates to σ(w.xi) when yi=1, else σ(-w.xi).

At this point, we should note that, (but not be achievable in many realistic scenarios)

The log-likelihood, L(w), conveniently breaks down into per-instance form.
There’s no coupling between different parameters, which greatly easies the optimization.

Since L(w) is a function of w, we don't have any closed-form solution. Thus, we would have to use iterative optimization methods like Gradient Ascent to find w. An update for the gradient ascent method would look like: We repeat the procedure in this equation until convergence and w obtained at the end is called the maximum likelihood estimate (MLE).

NorbertZheng commented 2 years ago

The case of latent features

In most of the realistic scenarios, it is common to

Have missing values in the dataset.
Choose sophisticated models (e.g. HMM) that have latent (unobserved) features which relate to the observed feature in the data.

Having latent features makes difference in the estimation of parameters. Estimating model parameters does get a little tricky if latent features (or missing data) are involved.

The issue

Let V be the set of observed variables, Z be the set of latent variables and θ be the set of model parameters. If we consider the maximum likelihood approach for the parameter estimation, our objective function will be:

We can see that the parameters are coupled because of summation inside the log. This makes the optimization using Gradient Ascent (or any iterative optimization technique in general) intractable. This means that many realistic scenarios need a more powerful technique to infer the parameters.

NorbertZheng commented 2 years ago

EM Algorithm to the Rescue

EM algorithm uses the fact that optimization of complete data log-likelihood P(V,Z|θ) is much easier when we know the value of Z (thus, removing the summation from inside the log).

However, since the only way to know the value of Z is through posterior P(Z|V,θ), we instead consider the expected value of complete data log likelihood under the posterior distribution of latent variables. The step of finding the expectation is called the E-step. In the subsequent M-step, we maximize this expectation to optimize θ.

Formally, the EM algorithm can be written as:

Choose initial setting for the parameters θ^{old}.
E-step: Evaluate P(Z|V,θ^{old}).
M-step: Evaluate θ^{new} given by
Check for convergence of log likelihood or parameter values. If not converged, then θ^{old}=θ^{new} and return to the E-step.

NorbertZheng commented 2 years ago

Diving Deeper into EM

Before diving deep, first, we will derive a property that will come in handy while explaining the E and M steps.

Let us consider a distribution q(Z) over the latent variables. Independet of choice of q(Z), we can decompose the likelihood of the observed data in the following fashion:

The first term contains joint distribution over V and Z, e.g. p(V,Z)|θ.
The second term contains the conditional distribution of Z given V, e.g. p(Z|V,θ).

One of the properties of KL divergence is that it's always non-negative. Using this property, we can deduce that

This means that L(q,θ), note that this is not L(θ), acts as a lower bound on the log-likelihood of the observed data. This observation, very shortly, would help in demonstrating that the EM algorithm does indeed maximize the log likelihood L(θ).

E-step

Suppose the initial value of the parameter vector θ is θ^{old}, step 1. Keeping in mind the relation, the E-step tries to maximize the lower bound L(q,θ^{old}) of L(θ) with respect to q while holding θ^{old} fixed.

The solution to this maximization problem is easily seen by noting the value of lnp(V|θ), e.g. L(θ), does not depend on q(Z), e.g. we can change q(Z) at will.
The largest value of L(q,θ^{old}) will occur when the KL divergence vanishes, or in other words when q(Z) is equal to the posterior distribution p(Z|V,θ^{old}).

Thus, E-step involves evaluating p(Z|V,θ^{old}), step 2.

M-step

In this step, the distribution q(Z) is held fixed. If we substitute q(Z)=p(Z|V,θ^{old}) from E-step into the expression of L(q,θ), we see that the lower bound tasks the following form:

where the constant is the negative entropy of the q distribution and is therefore independent of θ.

So, in the M-step, we maximize the lower bound L(q,θ) with respect to θ to give some new value θ^{new}. This will cause the lower bound L(q,θ) to increase, which will necessarily cause the corresponding log-likelihood function to increase, step 3.

Since the distribution q is held fixed during the M-step, it will not equal the new posterior distribution p(Z|V,θ^{new}), and hence there will be a non-zero KL divergence. So, we repeat the E and M steps again until convergence, step 4.

Putting it all together

Here, I'll try to summarize the discussion with the help of the following figure that should help in connecting the dots.

The red curve depicts the incomplete data log-likelihood, lnp(V|θ), which we want to maximize.

Start with some initial parameter value θ^{old}.
E-step: we evaluate the posterior distribution over latent variables, p(Z|V,θ^{old}), which gives rise to a lower bound L(q,θ^{old}) whose value equals the log likelihood at θ^{old}, as shown by the blue curve.
M-step: the bound is maximized giving the value θ^{new}, which gives a larger value of log-likelihood, e.g. L(θ), than θ^{old}.
The subsequent E-step then constructs a bound that is tangential at θ^{new} as shown by the green curve.
...

At each step, we see that the obtained parameters increase the log-likelihood and the process continues until convergence.

NorbertZheng commented 2 years ago

Conclusion

See here for more details about the mathematical derivation.

NorbertZheng / read-papers