NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Nat Biot '08 | What is the expectation maximization algorithm? #24

Closed NorbertZheng closed 2 years ago

NorbertZheng commented 2 years ago

Do CB, et. al. What is the expectation maximization algorithm?

NorbertZheng commented 2 years ago

Reference

Misra Rishabh. Inference using EM algorithm. Nasrabadi NM. Pattern recognition and machine learning. 史博. EM算法存在的意义是什么?

NorbertZheng commented 2 years ago

Introduction

Expectation-Maximization (EM) algorithm is a powerful algorithm in statistical analysis. It is powerful in the sense that it has the ability to deal with missing data and unobserved features, the use-cases for which come up frequently in many real-world applications.

NorbertZheng commented 2 years ago

The case of observed features

First, we should understand how parameters are estimated in simple use-cases. Thus, we fist consider the case where

Suppose we have a dataset with each data point consisting of a d dimensional feature vector X and an associated target variable Y∈{0,1}. A graphical representation of this model is depicted following: image

As we know, the prediction probability of the target variable in logistic regression is given by a sigmoid function like the following: image

where w is the weight vector to be assigned. Once we estimate the parameters w, we can produce output for unobserved data points depending upon whether Y=0 or Y=1 gets more probability. And we use the Maximum Likelihood approach to approximate w, e.g. finding parameters w that maximize the likelihood (or probability) of the observed data P(data).

For mathematical convenience, we'll consider maximizing the log of likelihood. Parameterized by w, e.g. w parameterize the joint distribution p(x,y), log-likelihood can be written as: image

Here the likelihood of observed data is represented as the multiplication of the likelihood of individual data points under the assumption that data samples are independently and identically distributed (so-called i.i.d.). Then we get: image

where we used the knowledge that P(Y=yi|X=xi) evaluates to σ(w.xi) when yi=1, else σ(-w.xi).

At this point, we should note that, (but not be achievable in many realistic scenarios)

Since L(w) is a function of w, we don't have any closed-form solution. Thus, we would have to use iterative optimization methods like Gradient Ascent to find w. An update for the gradient ascent method would look like: image We repeat the procedure in this equation until convergence and w obtained at the end is called the maximum likelihood estimate (MLE).

NorbertZheng commented 2 years ago

The case of latent features

In most of the realistic scenarios, it is common to

Having latent features makes difference in the estimation of parameters. Estimating model parameters does get a little tricky if latent features (or missing data) are involved.

The issue

Let V be the set of observed variables, Z be the set of latent variables and θ be the set of model parameters. If we consider the maximum likelihood approach for the parameter estimation, our objective function will be: image

We can see that the parameters are coupled because of summation inside the log. This makes the optimization using Gradient Ascent (or any iterative optimization technique in general) intractable. This means that many realistic scenarios need a more powerful technique to infer the parameters.

NorbertZheng commented 2 years ago

EM Algorithm to the Rescue

EM algorithm uses the fact that optimization of complete data log-likelihood P(V,Z|θ) is much easier when we know the value of Z (thus, removing the summation from inside the log).

However, since the only way to know the value of Z is through posterior P(Z|V,θ), we instead consider the expected value of complete data log likelihood under the posterior distribution of latent variables. The step of finding the expectation is called the E-step. In the subsequent M-step, we maximize this expectation to optimize θ.

Formally, the EM algorithm can be written as:

NorbertZheng commented 2 years ago

Diving Deeper into EM

Before diving deep, first, we will derive a property that will come in handy while explaining the E and M steps.

Let us consider a distribution q(Z) over the latent variables. Independet of choice of q(Z), we can decompose the likelihood of the observed data in the following fashion: image

One of the properties of KL divergence is that it's always non-negative. Using this property, we can deduce that image

This means that L(q,θ), note that this is not L(θ), acts as a lower bound on the log-likelihood of the observed data. This observation, very shortly, would help in demonstrating that the EM algorithm does indeed maximize the log likelihood L(θ).

E-step

Suppose the initial value of the parameter vector θ is θ^{old}, step 1. Keeping in mind the relation, the E-step tries to maximize the lower bound L(q,θ^{old}) of L(θ) with respect to q while holding θ^{old} fixed.

Thus, E-step involves evaluating p(Z|V,θ^{old}), step 2. image

M-step

In this step, the distribution q(Z) is held fixed. If we substitute q(Z)=p(Z|V,θ^{old}) from E-step into the expression of L(q,θ), we see that the lower bound tasks the following form: image

where the constant is the negative entropy of the q distribution and is therefore independent of θ.

So, in the M-step, we maximize the lower bound L(q,θ) with respect to θ to give some new value θ^{new}. This will cause the lower bound L(q,θ) to increase, which will necessarily cause the corresponding log-likelihood function to increase, step 3. image

Since the distribution q is held fixed during the M-step, it will not equal the new posterior distribution p(Z|V,θ^{new}), and hence there will be a non-zero KL divergence. So, we repeat the E and M steps again until convergence, step 4.

Putting it all together

Here, I'll try to summarize the discussion with the help of the following figure that should help in connecting the dots. image

The red curve depicts the incomplete data log-likelihood, lnp(V|θ), which we want to maximize.

At each step, we see that the obtained parameters increase the log-likelihood and the process continues until convergence.

NorbertZheng commented 2 years ago

Conclusion

See here for more details about the mathematical derivation. image