Ph.D. '17 | Variational Inference and Deep Learning: A New Synthesis.

NorbertZheng / read-papers

My paper reading notes.

MIT License

7 stars 0 forks source link

Ph.D. '17 | Variational Inference and Deep Learning: A New Synthesis. #21

Open NorbertZheng opened 2 years ago

NorbertZheng commented 2 years ago

Kingma, Diederik Pieter. Variational inference & deep learning: A new synthesis.

NorbertZheng commented 2 years ago

Related Reference

Rocca, Joseph. Understanding Variational Autoencoders (VAEs).

NorbertZheng commented 2 years ago

Introduction

Relying on huge amount of data, well-designed networks architectures and smart training techniques, deep generative models have shown an incredible ability to produce highly realistic pieces of content of various kind, such as images, texts and sounds. Among these deep generative models, two major families stand out and deserve a special attention: Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). In a nutshell, a VAE is an autoencoder whose encodings distribution is regularised during the training in order to ensure that its latent space has good properties allowing us to generate some new data. Moreover, the term “variational” comes from the close relation there is between the regularisation and the variational inference method in statistics. The purpose of this post is not only to discuss the fundamental notions Variational Autoencoders rely on but also to build step by step and starting from the very beginning the reasoning that leads to these notions.

What is an autoencoder?
What is the latent space and why regularising it?
How to generate new data from VAEs?
What is the link between VAEs and variational inference?

NorbertZheng commented 2 years ago

Dimensionality reduction, PCA and autoencoders

What is dimensionality reduction?

In machine learning, dimensionality reduction is the process of reducing the number of features that describe some data. This reduction is done by

Selection: Only some existing features are conserved.
Extraction: A reduced number of new features are created based on the old features.

This reduction can be useful in many situations that require low dimensional data (data visualisation, data storage, heavy computation…). Although there exists many different methods of dimensionality reduction, we can set a global framework that is matched by most (if not any!) of these methods.

First, let’s call:

Encoder: The process that produce the “new features” representation from the “old features” representation (by selection or by extraction). Dimensionality reduction can then be interpreted as data compression where the encoder compress the data (from the initial space to the encoded space, also called latent space). Of course, depending on the initial data distribution, the latent space dimension and the encoder definition, this compression can be lossy, meaning that a part of the information is lost during the encoding process and cannot be recovered when decoding.
Decoder: The reverse process of encoder, e.g. data decompression.

The main purpose of a dimensionality reduction method is to find the best encoder/decoder pair among a given family, e.g. looking for the pair from a given set of possible encoders and decoders:

Keep the maximum of information when encoding.
Have the minimum of reconstruction error when decoding.

where defines the reconstruction error measure between the input data x and the encoded-decoded data d(e(x)).

Principal components analysis (PCA)

The idea of PCA is to build n_e new independent features that are linear combinations (looking for the best linear subspace of the initial space (described by an orthogonal basis of new features)) of the n_d old features and so that the projections of the data on the subspace defined by these new features are as close as possible to the initial data (in term of euclidean distance).

Translated in our global framework, we are looking for an encoder in the family E of the n_e by n_d matrices (linear transformation) whose rows are orthonormal (features independence) and for the associated decoder among the family D of n_d by n_e matrices.

Autoencoders

The general idea of autoencoders is pretty simple and consists in

Set an encoder and a decoder as neural networks.
Learn the best encoding-decoding scheme using an iterative optimization process.

Thus, intuitively, the overall autoencoder architecture (encoder+decoder) creates a bottleneck for data that ensures only the main structured part of the information can go through and be reconstructed.

Let’s first suppose that both our encoder and decoder architectures have only one layer without non-linearity (linear autoencoder). Such encoder and decoder are then simple linear transformations that can be expressed as matrices. In such a situation, we can see a clear link with PCA in the sense that, just like PCA does, we are looking for the best linear subspace to project data on with as little information loss as possible when doing so. Encoding and decoding matrices obtained with PCA define naturally one of the solutions we would be satisfied to reach by gradient descent, but we should outline that this is not the only one. Indeed, several basis can be chosen to describe the same optimal subspace and, so, several encoder/decoder pairs can give the optimal reconstruction error.

Now, let’s assume that both the encoder and the decoder are deep and non-linear. In such case, the more complex the architecture is, the more the autoencoder can proceed to a high dimensionality reduction while keeping reconstruction loss low. Intuitively, if our encoder and our decoder have enough degrees of freedom, we can reduce any initial dimensionality to 1. Indeed, an encoder with “infinite power” could theoretically takes our N initial data points and encodes them as 1, 2, 3, … up to N (or more generally, as N integer on the real axis) and the associated decoder could make the reverse transformation, with no loss during the process.

Here, we should however keep two things in mind.

An important dimensionality reduction with no reconstruction loss often comes with a price: the lack of interpretable and exploitable structures in the latent space (lack of regularity).
Most of the time the final purpose of dimensionality reduction is not to only reduce the number of dimensions of the data but to reduce this number of dimensions while keeping the major part of the data structure information in the reduced representations.

For these two reasons, the dimension of the latent space and the “depth” of autoencoders (that define degree and quality of compression) have to be carefully controlled and adjusted depending on the final purpose of the dimensionality reduction.

NorbertZheng commented 2 years ago

Variational Autoencoders

Limitations of autoencoders for content generation

If the latent space is regular enough (well “organized” by the encoder during the training process), we could take a point randomly from that latent space and decode it to get a new content. The decoder would then act more or less like the generator of a Generative Adversarial Network. However, as we discussed in the previous section, the regularity of the latent space for autoencoders is a difficult point that depends on the distribution of the data in the initial space, the dimension of the latent space and the architecture of the encoder. So, it is pretty difficult (if not impossible) to ensure, a priori, that the encoder will organize the latent space in a smart way compatible with the generative process we just described. To illustrate this point, let’s consider the example we gave previously in which we described an encoder and a decoder powerful enough to put any N initial training data onto the real axis (each data point being encoded as a real value) and decode them without any reconstruction loss. In such case, the high degree of freedom of the autoencoder that makes possible to encode and decode with no information loss (despite the low dimensionality of the latent space) leads to a severe overfitting implying that some points of the latent space will give meaningless content once decoded. If this one dimensional example has been voluntarily chosen to be quite extreme, we can notice that the problem of the autoencoders latent space regularity is much more general than that and deserve a special attention. When thinking about it for a minute, this lack of structure among the encoded data into the latent space is pretty normal. Indeed, nothing in the task the autoencoder is trained for enforce to get such organisation: the autoencoder is solely trained to encode and decode with as few loss as possible, no matter how the latent space is organised. Thus, if we are not careful about the definition of the architecture, it is natural that, during the training, the network takes advantage of any overfitting possibilities to achieve its task as well as it can… unless we explicitly regularise it!

NorbertZheng commented 2 years ago

Definition of variational autoencoders

So, in order to be able to use the decoder of out autoencoder for generative purpose, we have to be sure that the latent space is regular enough. One possible solution to obtain such regularity is to introduce explicit regularization during the training process. And a variational autoencoder can be defined as being an autoencoder whose training is regularized to avoid overfitting and ensure that the latent space has good properties that enable generative process. Just as a standard autoencoder, a variational autoencoder is an architecture composed of both an encoder and a decoder and that is trained to minimize the reconstruction error between the encoded-decoded data and the initial data. However, in order to introduce some regularization of the latent space, we proceed to a slight modification of the encoding-decoding process: instead of encoding an input as a single point, we encode it as a distribution over the latent space. The model is then trained as follows:

First, the input is encoded as distribution over the latent space.
Second, a point from the latent space is sampled from that distribution.
Third, the sampled point is decoded and the reconstruction error can be computed.
Finally, the reconstruction error is backpropagated through the network. In practice, the encoded distributions are chosen to be normal so that the encoder can be trained to return the mean and the covariance matrix that describe these Gaussians. The reason why an input is encoded as a distribution with some variance instead of a single point is that it makes possible to express very naturally the latent space regularization: the distributions returned by the encoder are enforced to be close to a standard normal distribution (we cannot regularize a single point, there is no difference between points, but there is explicit difference between distributions). We will see in the next subsection that we ensure this way both a local and global regularization of the latent space (local because of the variance control and global because of the mean control). Thus, the loss function that is minimized when training a VAE is composed of
Reconstruction term (on the final layer): The term tends to make the encoding-decoding scheme as preformant as possible.
Regularization term (on the latent layer): The term tends to regularize the organization of the latent space by making the distributions returned by the encoder close to a standard normal distribution.

That regularization term is expressed as the Kulback-Leibler divergence between the returned distribution and a standard Gaussian. We can notice that the Kulback-Leibler divergence between two Gaussian distributions has a closed form that can be directly expressed in terms of the means and the covariance matrices of the two distributions.

NorbertZheng commented 2 years ago

Intuitions about the regularisation

The regularity that is expected from the latent space in order to make generative process possible can be expressed through two main properties:

Continuity: Two close points in the latent space should not give two completely different contents once decoded.
Completeness: For a chosen distribution, a point sampled from the latent space should give "meaningful" content once decoded.

The only fact that VAEs encode inputs as distributions instead of simple points is not sufficient to ensure continuity and completeness. Without a well defined regularization term, the model can learn, in order to minimize its reconstruction error, to "ignore" the fact that distributions are returned and behave almost like classic auto-encoders (leading to over-fitting). To do so, the encoder can

Return distributions with tiny variances (that would tend to be punctual distributions).
Return distributions with very different means (that would then be really far apart from each other in the latent space).

In both cases, distributions are used the wrong way (cancelling the expected benefit) and continuity and/or completeness are not satisfied.

So, in order to avoid these effects we have to regularize both the covariance matrix and the mean of the distributions returned by the encoder. In practice, this regularization is done by enforcing distributions to be close to a standard normal distribution (centered and reduced). This way, we require

The covariance matrices to be close to the identity, preventing punctual distributions.
The mean to be close to 0, preventing encoded distributions to be too far apart from each others.

With this regularization term, we prevent the model to encode data far apart in the latent space and encourage as much as possible returned distributions to "overlap", satisfying this way the expected continuity and completeness conditions. Naturally, as for any regularization term, this comes at the price of a higher reconstruction error on the training data. The trade-off between the reconstruction error and the KL-divergence can however be adjusted and we will see in the next section how the expression of the balance naturally emerge from our formal derivation.

To conclude, we can observe that continuity and completeness obtained with regularization tend to create a "gradient" over the information encoded in the latent space. For example, a point of the latent space that would be halfway between the means of two encoded distributions coming from different training data should be decoded in something that is somewhere between the data that gave the first distribution and the data that gave the second distribution as it may be sampled by the auto-encoder in both cases.

This reminds me of a continuous attractor neural network (CANN), there is a continuous set of attractors in the latent space, which corresponds to the gradient over the information encoded in the latent space, satisfying the continuity property of VAEs.

Note. As a side note, we can mention that the second potential problem we have mentioned (the network put distributions far from each others) is in fact almost equivalent to the first one (the network tends to return punctual distribution) up to a change of scale: in both case variances of distributions become small relatively to distance between their means.

NorbertZheng commented 2 years ago

Mathematical details of VAEs

In the previous section we gave the following intuitive overview:

VAEs are autoencoders that encode inputs as distributions instead of points and whose latent space "organization" is regularized by constraining distributions returned by the encoder to be close to a standard Gaussian. In this section we will give a more mathematical view of VAEs that will allow us to justify the regularization term more rigorously. To do so, we will set a clear probabilistic framework and will use, in particular, variational inference technique.
Probabilistic framework and assumptions

Let's begin by defining a probabilistic graphical model to describe our data. We denote by x the variable that represents our data and assume that x is generated from a latent variable z (the encoded representation) that is not directly observed. Thus, for each data point, the following two steps generative process is assumed:
First, a latent representation z is sampled from the prior distribution p(z).
Second, the data x is sampled from the conditional likelihood distribution p(x|z).

With such a probabilistic model in mind, we can redefine our notions of encoder and decoder. Indeed, contrarily to a simple autoencoder that consider determinisitic encoder and decoder, we are going to consider now probabilistic versions of these two objects.

Probabilistic decoder: p(x|z), that describes the distribution of the decoded variable given the encoded one.
Probabilistic encoder: p(z|x), that describes the distribution of the encoded variable given the decoded one.

At this point, we can already notice that the regularization of the latent space that we lacked in simple autoencoders naturally appears here in the definition of the data generation process: encoded representations z in the latent space are indeed assumed to follow the prior distribution p(z). Otherwise, we can also remind the well-known Bayes theorem that makes the link between the prior p(z), the likelihood p(x|z), and the posterior p(z|x):

Let's now make the assumption (regularization):

p(z) is a standard Gaussian distribution.
p(x|z) is a Gaussian distribution:
- The mean is defined by a deterministic function f of the variable z.
- The covariance matrix has the form of a positive constant c that multiplies the identity matrix I.

The function f is assumed to belong to a family of functions denoted F that is left unspecified for the moment and that will be chosen later. Thus, we have

Let's consider, for now, that f is well defined and fixed. In theory, as we know p(z) and p(x|z), we can use the Bayes theorem to compute p(z|x): this is a classical Bayesian inference problem. However, as we discussed in our previous article, this kind of computation is often intractable (because of the integral at the denominator) and require the use of approximation techniques such as variational inference.

Note. Here we can mention that p(z) and p(x|z) are both Gaussian distribution. So, if we had E(x|z)=f(z)=z, it would imply that p(z|x) should also follow a Gaussian distribution and, in theory, we could "only" try to express the mean and the covariance matrix of p(z|x) with respect to the means and the covariance matrices of p(z) and p(x|z). However, in practice this condition (e.g. E(x|z)=f(z)=z) is not met and we need to use of an approximation technique like variational inference that makes the approach pretty general and more robust to some changes in the hypothesis of the model (generative model).

NorbertZheng commented 2 years ago

Variational inference formulation

In statistics, variational inference(VI) is a technique to approximate complex distributions. The idea is to set a parameterized family of distribution (for example the family of Gaussians, whose parameters are the mean and the covariance) and to look for the best approximation of our target distribution among this family. The best element in the family is one that minimizes a given approximation error measurement (most of the time the Kullback-Leibler divergence between approximation and target) and is found by gradient descent over the parameters that describe the family. Please refer to Variational Inference for more details. Here we are going to approximate p(z|x) by a Gaussian distribution q_x(z) whose mean and covariance are defined by two functions, g and h, of the parameter x. These two functions are supposed to belong, respectively, to the families of the functions G and H that will be specified later but that are supposed to be parameterized. Thus we can denote

So, we have defined this way a family of candidates for variational inference and need now to find the best approximation among this family by optimizing the functions g and h (in fact, their parameters) to minimize the Kullback-Leibler divergence between the approximation and the target p(z|x). In other words, we are looking for the optimal g* and h* such that

In the second last equation, we can observe the trade-off there exists, when approximating the posterior p(z|x), between

Maximizing the likelihood of the "observation" (maximization of the expected log-likelihood, for the first term).
Staying close to the prior distribution (minimization of the KL-divergence between q_x(z) and p(z), for the second term).

This trade-off is natural for Bayesian inference problem and express the balance that needs to be found between the confidence we have in the data and the confidence we have in the prior.

Up to now, we have assumed the function f known and fixed and we have showed that, under such assumptions, we can approximate the posterior p(z|x) using variational inference technique. However, in practice this function f, that defines the decoder, is not known and also need to be chosen. To do so, let's remind that our initial goal is to find a performant encoding-decoding scheme whose latent space is regular enough to be used for generative purpose. If the regularity is mostly ruled by the prior distribution assumed over the latent space, the performance of the overall encoding-decoding scheme highly depends on the choice of the function f. Indeed, as p(z|x) can be approximate (by variational inference) from p(z) and p(x|z) and as p(x) is a simple standard Gaussian, the only two levers we have at our disposal in our model to make optimizations are

The parameter c: Defines the variance of the likelihood.
The function f: Defines the mean of the likelihood.

So, let's consider that, as we discussed earlier, we can get for any function f in F (each defining a different probabilistic decoder p(x|z)) the best approximation of p(z|x), denoted q*_x(z). Despite its probabilistic nature, we are looking for an encoding-decoding scheme as efficient as possible and, then, we want to choose the function f that maximizes the expected log-likelihood of x given z when z is sampled from q*_x(z). In other words, for a given input x, we want to maximize the probability to have x^=x when we sample z from the distribution q*_x(z) and then sample x^ from the distribution p(x|z). Thus, we are looking for the optimal f* such that

where q*_x(z) depends on the function f and is obtained as described before. Gathering all the pieces together, we are looking for optimal f*, g* and h* such that

We can identify in this objective function the elements introduced in the intuitive description of VAEs given in the previous section:

The reconstruction error between x and f(z).
The regularization term given by the KL-divergence between q_x(z) and p(z) (which is a standard Gaussian).

We can also notice the constant c that rules the balance between the two previous terms. The higher c is the more we assume a high variance around f(z) for the probabilistic decoder in our model and, so, the more we favour the regularization term over the reconstruction term (and the opposite stands if c is low).

NorbertZheng commented 2 years ago

Bringing neural networks into the model

Up to now, we have set a probabilistic model that depends on three functions, f, g and h, and express, using variational inference, the optimization problem to solve in order to get f*, g* and h* that give the optimal encoding-decoding scheme with this model. As we can't easily optimize over the entire space of functions, we constrain the optimization domain and decide to express f, g and h as neural networks. Thus, F, G and H correspond respectively to the families of functions defined by the networks architectures and optimization is done over the parameters of these networks.

In practice, g and h are not defined by two completely independent networks but share a part of their architecture and their weights so that we have

As it defines the covariance matrix of q_x(z), h(x) is supposed to be a square matrix. However, in order to simplify the computation and reduce the number of parameters, we make the additional assumption that our approximation of p(z|x), q_x(z), is a multidimensional Gaussian distribution with diagonal covariance matrix (variables independence assumption). With this assumption, h(x) is simply the vector (just like TEM #16 does) of the diagonal elements of the covariance matrix and has then the same size as g(x). However, we reduce this way the family of distributions we consider for variational inference and, so, the approximation of p(z|x) obtained can be less accurate.

Contrarily to the encoder part that models p(z|x) and for which we considered a Gaussian with both mean and covariance that are functions of x (g and h), our model assumes for p(x|z) a Gaussian with fixed covariance. The function f of the variable z defining the mean of that Gaussian is modelled by a neural network and can be represented as follows:

The overall architecture is then obtained by concatenating the encoder and the decoder parts. However we still need to be very careful about the way we sample from the distribution returned by the encoder during the training. The sampling process has to be expressed in a way that allows the error to be backpropagated through the network. A simple trick, called representation trick, is used to make the gradient descent possible despite the random sampling that occurs halfway of the architecture and consists in using the fact that if z is a random variable following a Gaussian distribution with mean g(x) and with covariance H(x)=h(x).h^t(x) then it can be expressed as

Finally, the objective function of the variational autoencoder architecture obtained this way is given by the last equation of the previous subsection in which the theoretical expectancy is replaced by a more or less accurate Monte-Carlo approximation that consists, most of the time, into a single draw. So, considering this approximation and denoting C=1/(2c), we recover the loss function derived intuitively in the previous section, composed of a reconstruction term, a regularizaion term and a constant to define the relative weights of these two terms.

NorbertZheng commented 2 years ago

Takeaways

The main takeaways of thie article are:

Dimensionality reduction is the process of reducing the number of features that describe some data (either by selecting only a subset of the initial features or by combining them into a reduced number new features) and, so, can be seen as an encoding process.
Autoencoders are neural networks architectures composed of both an encoder and a decoder that create a bottleneck to go through for data and that are trained to lose a minimal quantity of information during the encoding-decoding process (training bt gradient descent iterations with the goal to reduce the reconstruction error).
Due to over-fitting, the latent space of an autoencoder can be extremely irregular (close points in latent space can give very different decoded data, some point of the latent space can give meaningless content once decoded, ...) and, so, we can't really define a generative process that simply consists to sample a point from the latent space and make it go through the decoder to get a new data.
Variational AutoEncoders(VAEs) are autoencoders that tackle the problem of the latent space irregularity by making the encoder return a distribution over the latent space instead of a single point and by adding in the loss function a regularization term over the returned distribution in order to ensure a better organization of the latent space.
Assuming a simple underlying probabilistic model to describe our data, the pretty intuitive loss function of VAEs, composed of a reconstruction term and a regularization term, can be carefully derived, using in particular the statistical technique of variational inference (hence the name "variational" autoencoders).

To conclude, we can outline that, during the last years, GANs have benefited from much more scientific contributions than VAEs. Among other reasons, the higher interest that has been shown by the community for GANs can be partly explained by the higher degree of complexity in VAEs theoretical basis (probabilistic model and variational inference) compared to the simplicity of the adversarial training concept that rules GANs.

NorbertZheng / read-papers

Ph.D. '17 | Variational Inference and Deep Learning: A New Synthesis. #21

Related Reference

Introduction

Dimensionality reduction, PCA and autoencoders

What is dimensionality reduction?

Principal components analysis (PCA)

Autoencoders

Variational Autoencoders

Limitations of autoencoders for content generation

Definition of variational autoencoders

Intuitions about the regularisation

Mathematical details of VAEs

Probabilistic framework and assumptions

Variational inference formulation

Bringing neural networks into the model

Takeaways