Open NorbertZheng opened 2 years ago
Relying on huge amount of data, well-designed networks architectures and smart training techniques, deep generative models have shown an incredible ability to produce highly realistic pieces of content of various kind, such as images, texts and sounds. Among these deep generative models, two major families stand out and deserve a special attention: Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). In a nutshell, a VAE is an autoencoder whose encodings distribution is regularised during the training in order to ensure that its latent space has good properties allowing us to generate some new data. Moreover, the term “variational” comes from the close relation there is between the regularisation and the variational inference method in statistics. The purpose of this post is not only to discuss the fundamental notions Variational Autoencoders rely on but also to build step by step and starting from the very beginning the reasoning that leads to these notions.
In machine learning, dimensionality reduction is the process of reducing the number of features that describe some data. This reduction is done by
This reduction can be useful in many situations that require low dimensional data (data visualisation, data storage, heavy computation…). Although there exists many different methods of dimensionality reduction, we can set a global framework that is matched by most (if not any!) of these methods.
First, let’s call:
The main purpose of a dimensionality reduction method is to find the best encoder/decoder pair among a given family, e.g. looking for the pair from a given set of possible encoders and decoders:
where defines the reconstruction error measure between the input data x and the encoded-decoded data d(e(x)).
The idea of PCA is to build n_e new independent features that are linear combinations (looking for the best linear subspace of the initial space (described by an orthogonal basis of new features)) of the n_d old features and so that the projections of the data on the subspace defined by these new features are as close as possible to the initial data (in term of euclidean distance).
Translated in our global framework, we are looking for an encoder in the family E of the n_e by n_d matrices (linear transformation) whose rows are orthonormal (features independence) and for the associated decoder among the family D of n_d by n_e matrices.
The general idea of autoencoders is pretty simple and consists in
Thus, intuitively, the overall autoencoder architecture (encoder+decoder) creates a bottleneck for data that ensures only the main structured part of the information can go through and be reconstructed.
Let’s first suppose that both our encoder and decoder architectures have only one layer without non-linearity (linear autoencoder). Such encoder and decoder are then simple linear transformations that can be expressed as matrices. In such a situation, we can see a clear link with PCA in the sense that, just like PCA does, we are looking for the best linear subspace to project data on with as little information loss as possible when doing so. Encoding and decoding matrices obtained with PCA define naturally one of the solutions we would be satisfied to reach by gradient descent, but we should outline that this is not the only one. Indeed, several basis can be chosen to describe the same optimal subspace and, so, several encoder/decoder pairs can give the optimal reconstruction error.
Now, let’s assume that both the encoder and the decoder are deep and non-linear. In such case, the more complex the architecture is, the more the autoencoder can proceed to a high dimensionality reduction while keeping reconstruction loss low. Intuitively, if our encoder and our decoder have enough degrees of freedom, we can reduce any initial dimensionality to 1. Indeed, an encoder with “infinite power” could theoretically takes our N initial data points and encodes them as 1, 2, 3, … up to N (or more generally, as N integer on the real axis) and the associated decoder could make the reverse transformation, with no loss during the process.
Here, we should however keep two things in mind.
For these two reasons, the dimension of the latent space and the “depth” of autoencoders (that define degree and quality of compression) have to be carefully controlled and adjusted depending on the final purpose of the dimensionality reduction.
If the latent space is regular enough (well “organized” by the encoder during the training process), we could take a point randomly from that latent space and decode it to get a new content. The decoder would then act more or less like the generator of a Generative Adversarial Network. However, as we discussed in the previous section, the regularity of the latent space for autoencoders is a difficult point that depends on the distribution of the data in the initial space, the dimension of the latent space and the architecture of the encoder. So, it is pretty difficult (if not impossible) to ensure, a priori, that the encoder will organize the latent space in a smart way compatible with the generative process we just described. To illustrate this point, let’s consider the example we gave previously in which we described an encoder and a decoder powerful enough to put any N initial training data onto the real axis (each data point being encoded as a real value) and decode them without any reconstruction loss. In such case, the high degree of freedom of the autoencoder that makes possible to encode and decode with no information loss (despite the low dimensionality of the latent space) leads to a severe overfitting implying that some points of the latent space will give meaningless content once decoded. If this one dimensional example has been voluntarily chosen to be quite extreme, we can notice that the problem of the autoencoders latent space regularity is much more general than that and deserve a special attention. When thinking about it for a minute, this lack of structure among the encoded data into the latent space is pretty normal. Indeed, nothing in the task the autoencoder is trained for enforce to get such organisation: the autoencoder is solely trained to encode and decode with as few loss as possible, no matter how the latent space is organised. Thus, if we are not careful about the definition of the architecture, it is natural that, during the training, the network takes advantage of any overfitting possibilities to achieve its task as well as it can… unless we explicitly regularise it!
So, in order to be able to use the decoder of out autoencoder for generative purpose, we have to be sure that the latent space is regular enough. One possible solution to obtain such regularity is to introduce explicit regularization during the training process. And a variational autoencoder can be defined as being an autoencoder whose training is regularized to avoid overfitting and ensure that the latent space has good properties that enable generative process. Just as a standard autoencoder, a variational autoencoder is an architecture composed of both an encoder and a decoder and that is trained to minimize the reconstruction error between the encoded-decoded data and the initial data. However, in order to introduce some regularization of the latent space, we proceed to a slight modification of the encoding-decoding process: instead of encoding an input as a single point, we encode it as a distribution over the latent space. The model is then trained as follows:
That regularization term is expressed as the Kulback-Leibler divergence between the returned distribution and a standard Gaussian. We can notice that the Kulback-Leibler divergence between two Gaussian distributions has a closed form that can be directly expressed in terms of the means and the covariance matrices of the two distributions.
The regularity that is expected from the latent space in order to make generative process possible can be expressed through two main properties:
The only fact that VAEs encode inputs as distributions instead of simple points is not sufficient to ensure continuity and completeness. Without a well defined regularization term, the model can learn, in order to minimize its reconstruction error, to "ignore" the fact that distributions are returned and behave almost like classic auto-encoders (leading to over-fitting). To do so, the encoder can
In both cases, distributions are used the wrong way (cancelling the expected benefit) and continuity and/or completeness are not satisfied.
So, in order to avoid these effects we have to regularize both the covariance matrix and the mean of the distributions returned by the encoder. In practice, this regularization is done by enforcing distributions to be close to a standard normal distribution (centered and reduced). This way, we require
With this regularization term, we prevent the model to encode data far apart in the latent space and encourage as much as possible returned distributions to "overlap", satisfying this way the expected continuity and completeness conditions. Naturally, as for any regularization term, this comes at the price of a higher reconstruction error on the training data. The trade-off between the reconstruction error and the KL-divergence can however be adjusted and we will see in the next section how the expression of the balance naturally emerge from our formal derivation.
To conclude, we can observe that continuity and completeness obtained with regularization tend to create a "gradient" over the information encoded in the latent space. For example, a point of the latent space that would be halfway between the means of two encoded distributions coming from different training data should be decoded in something that is somewhere between the data that gave the first distribution and the data that gave the second distribution as it may be sampled by the auto-encoder in both cases.
This reminds me of a continuous attractor neural network (CANN), there is a continuous set of attractors in the latent space, which corresponds to the gradient over the information encoded in the latent space, satisfying the continuity property of VAEs.
Note. As a side note, we can mention that the second potential problem we have mentioned (the network put distributions far from each others) is in fact almost equivalent to the first one (the network tends to return punctual distribution) up to a change of scale: in both case variances of distributions become small relatively to distance between their means.
In the previous section we gave the following intuitive overview:
Let's begin by defining a probabilistic graphical model to describe our data. We denote by x
the variable that represents our data and assume that x
is generated from a latent variable z
(the encoded representation) that is not directly observed. Thus, for each data point, the following two steps generative process is assumed:
z
is sampled from the prior distribution p(z)
.x
is sampled from the conditional likelihood distribution p(x|z)
.With such a probabilistic model in mind, we can redefine our notions of encoder and decoder. Indeed, contrarily to a simple autoencoder that consider determinisitic encoder and decoder, we are going to consider now probabilistic versions of these two objects.
p(x|z)
, that describes the distribution of the decoded variable given the encoded one.p(z|x)
, that describes the distribution of the encoded variable given the decoded one.At this point, we can already notice that the regularization of the latent space that we lacked in simple autoencoders naturally appears here in the definition of the data generation process: encoded representations z
in the latent space are indeed assumed to follow the prior distribution p(z)
. Otherwise, we can also remind the well-known Bayes theorem that makes the link between the prior p(z)
, the likelihood p(x|z)
, and the posterior p(z|x)
:
Let's now make the assumption (regularization):
p(z)
is a standard Gaussian distribution.p(x|z)
is a Gaussian distribution:
f
of the variable z
.c
that multiplies the identity matrix I
.The function f
is assumed to belong to a family of functions denoted F
that is left unspecified for the moment and that will be chosen later. Thus, we have
Let's consider, for now, that f
is well defined and fixed. In theory, as we know p(z)
and p(x|z)
, we can use the Bayes theorem to compute p(z|x)
: this is a classical Bayesian inference problem. However, as we discussed in our previous article, this kind of computation is often intractable (because of the integral at the denominator) and require the use of approximation techniques such as variational inference.
Note. Here we can mention that p(z)
and p(x|z)
are both Gaussian distribution. So, if we had E(x|z)=f(z)=z
, it would imply that p(z|x)
should also follow a Gaussian distribution and, in theory, we could "only" try to express the mean and the covariance matrix of p(z|x)
with respect to the means and the covariance matrices of p(z)
and p(x|z)
. However, in practice this condition (e.g. E(x|z)=f(z)=z
) is not met and we need to use of an approximation technique like variational inference that makes the approach pretty general and more robust to some changes in the hypothesis of the model (generative model).
In statistics, variational inference(VI) is a technique to approximate complex distributions. The idea is to set a parameterized family of distribution (for example the family of Gaussians, whose parameters are the mean and the covariance) and to look for the best approximation of our target distribution among this family. The best element in the family is one that minimizes a given approximation error measurement (most of the time the Kullback-Leibler divergence between approximation and target) and is found by gradient descent over the parameters that describe the family. Please refer to Variational Inference for more details.
Here we are going to approximate p(z|x)
by a Gaussian distribution q_x(z)
whose mean and covariance are defined by two functions, g
and h
, of the parameter x
. These two functions are supposed to belong, respectively, to the families of the functions G
and H
that will be specified later but that are supposed to be parameterized. Thus we can denote
So, we have defined this way a family of candidates for variational inference and need now to find the best approximation among this family by optimizing the functions g
and h
(in fact, their parameters) to minimize the Kullback-Leibler divergence between the approximation and the target p(z|x)
. In other words, we are looking for the optimal g*
and h*
such that
In the second last equation, we can observe the trade-off there exists, when approximating the posterior p(z|x)
, between
q_x(z)
and p(z)
, for the second term).This trade-off is natural for Bayesian inference problem and express the balance that needs to be found between the confidence we have in the data and the confidence we have in the prior.
Up to now, we have assumed the function f
known and fixed and we have showed that, under such assumptions, we can approximate the posterior p(z|x)
using variational inference technique. However, in practice this function f
, that defines the decoder, is not known and also need to be chosen. To do so, let's remind that our initial goal is to find a performant encoding-decoding scheme whose latent space is regular enough to be used for generative purpose. If the regularity is mostly ruled by the prior distribution assumed over the latent space, the performance of the overall encoding-decoding scheme highly depends on the choice of the function f
. Indeed, as p(z|x)
can be approximate (by variational inference) from p(z)
and p(x|z)
and as p(x)
is a simple standard Gaussian, the only two levers we have at our disposal in our model to make optimizations are
c
: Defines the variance of the likelihood.f
: Defines the mean of the likelihood.So, let's consider that, as we discussed earlier, we can get for any function f
in F
(each defining a different probabilistic decoder p(x|z)
) the best approximation of p(z|x)
, denoted q*_x(z)
. Despite its probabilistic nature, we are looking for an encoding-decoding scheme as efficient as possible and, then, we want to choose the function f
that maximizes the expected log-likelihood of x
given z
when z
is sampled from q*_x(z)
. In other words, for a given input x
, we want to maximize the probability to have x^=x
when we sample z
from the distribution q*_x(z)
and then sample x^
from the distribution p(x|z)
. Thus, we are looking for the optimal f*
such that
where q*_x(z)
depends on the function f
and is obtained as described before. Gathering all the pieces together, we are looking for optimal f*
, g*
and h*
such that
We can identify in this objective function the elements introduced in the intuitive description of VAEs given in the previous section:
x
and f(z)
.q_x(z)
and p(z)
(which is a standard Gaussian).We can also notice the constant c
that rules the balance between the two previous terms. The higher c
is the more we assume a high variance around f(z)
for the probabilistic decoder in our model and, so, the more we favour the regularization term over the reconstruction term (and the opposite stands if c
is low).
Up to now, we have set a probabilistic model that depends on three functions, f
, g
and h
, and express, using variational inference, the optimization problem to solve in order to get f*
, g*
and h*
that give the optimal encoding-decoding scheme with this model. As we can't easily optimize over the entire space of functions, we constrain the optimization domain and decide to express f
, g
and h
as neural networks. Thus, F
, G
and H
correspond respectively to the families of functions defined by the networks architectures and optimization is done over the parameters of these networks.
In practice, g
and h
are not defined by two completely independent networks but share a part of their architecture and their weights so that we have
As it defines the covariance matrix of q_x(z)
, h(x)
is supposed to be a square matrix. However, in order to simplify the computation and reduce the number of parameters, we make the additional assumption that our approximation of p(z|x)
, q_x(z)
, is a multidimensional Gaussian distribution with diagonal covariance matrix (variables independence assumption). With this assumption, h(x)
is simply the vector (just like TEM #16 does) of the diagonal elements of the covariance matrix and has then the same size as g(x)
. However, we reduce this way the family of distributions we consider for variational inference and, so, the approximation of p(z|x)
obtained can be less accurate.
Contrarily to the encoder part that models p(z|x)
and for which we considered a Gaussian with both mean and covariance that are functions of x
(g
and h
), our model assumes for p(x|z)
a Gaussian with fixed covariance. The function f
of the variable z
defining the mean of that Gaussian is modelled by a neural network and can be represented as follows:
The overall architecture is then obtained by concatenating the encoder and the decoder parts. However we still need to be very careful about the way we sample from the distribution returned by the encoder during the training. The sampling process has to be expressed in a way that allows the error to be backpropagated through the network. A simple trick, called representation trick, is used to make the gradient descent possible despite the random sampling that occurs halfway of the architecture and consists in using the fact that if z
is a random variable following a Gaussian distribution with mean g(x)
and with covariance H(x)=h(x).h^t(x)
then it can be expressed as
Finally, the objective function of the variational autoencoder architecture obtained this way is given by the last equation of the previous subsection in which the theoretical expectancy is replaced by a more or less accurate Monte-Carlo approximation that consists, most of the time, into a single draw. So, considering this approximation and denoting C=1/(2c)
, we recover the loss function derived intuitively in the previous section, composed of a reconstruction term, a regularizaion term and a constant to define the relative weights of these two terms.
The main takeaways of thie article are:
To conclude, we can outline that, during the last years, GANs have benefited from much more scientific contributions than VAEs. Among other reasons, the higher interest that has been shown by the community for GANs can be partly explained by the higher degree of complexity in VAEs theoretical basis (probabilistic model and variational inference) compared to the simplicity of the adversarial training concept that rules GANs.
Kingma, Diederik Pieter. Variational inference & deep learning: A new synthesis.