Closed NorbertZheng closed 2 years ago
Rocca, Joseph. Understanding Generative Adversarial Networks (GANs).
See here for the cheetsheet of GAN.
In this post, we will see that adversarial training is an enlightening idea, beautiful by its simplicity, that represents real conceptual progress for Machine Learning and more especially for generative models (in the same way as backpropagation is a simple but really smart trick that made the ground idea of neural networks became so popular and efficient). In the following parts, we will overcome the apparent magic of GANs in order to dive into ideas, maths, and modeling behind these models.
We remind some existing methods and more especially the inverse transform method that allows to generate complex random variables from simple uniform random variables (MCMC?).
Computers are fundamentally determinisitic. So, it is, in theory, impossible to generate numbers that are really random. However, a computer is able, using a pseudorandom number generator, to generate a sequence of numbers that approximately follows a uniform random distribution between 0 and 1.
There exist different techniques that are aimed at generating more complex random variables. Among them we can find,
All these methods rely on different mathematical tricks that mainly consist in representing the random variable we want to generate as the result of an operation (over simpler random variables) or a process.
Rejection Sampling expresses the random variables as the result of a process that
Repeating this process until the sampled value is accepted, we can show that with the right condition of acceptance the value that will be efficiently sampled will follow the right distribution.
In the Metropolis-Hasting algorithm, the idea is to find a Markov Chain (MC) such that the stationary distribution of this MC corresponds to the distribution from which we would like to sample our random variable. Once this MC found, we can simulate a long enough trajectory over this MC to consider that we have reach a steady state and then the last value we obtain this way can be considered as having been drawn from the distribution of interest.
The idea of inverse transform method is simply to represent our complex random variables as the result of a function applied to a uniform random variable we know how to generate.
We consider in what follows a one dimensional example. Let X
be a complex random variable we want to sample from and U
be a uniform random variable over [0,1]
we know how to sample from. We remind that a random variable is fully defined by its Cumulative Distribution Function (CDF). The CDF of a random variable is a function from the domain of definition of the random variable to the interval [0,1]
and defined, in one dimension, such that
In the particular case of our uniform random variable U
, we have
For simplicity, we will suppose here that the function CDF_X
is invertible and its inverse is denoted
(the method can easily be extended to the non-invertible case by using the generalized inverse of the function but it is really not the main point we want to focus on here). Then if we define
we have
As we can see, Y
and X
have the same CDF and then define the same random variable. So, by defining Y
as above (as a function of a uniform random variable) we have managed to define a random variable with the targeted distribution.
To summarize, inverse transform method is a way to generate a random variable that follows a given distribution by making a uniform random variable goes through a well designed "transform function" (inverse CDF). This notion of "inverse transform method" can, infact, be extended to the notion of "transform method" that consists, more generally, in generating random variables as function of some simpler random variables (not necessarily uniform and then the transform function is no longer the inverse CDF). Conceptually, the purpose of the "transform function" is to deform/reshape the initialial probability distribution:
The problem of generating a new image of dog is equivalent to the problem of generating a new vector following the desired probability distribution over the N
dimensional vector space. So we are, in fact, facing a problem of generating a random variable with respect to a specific probability distribution.
At this point, we can mention two important things.
Both previous points make the process of generating random variables from this distribution pretty difficult.
How to directly generate complex random variables? As we know pretty well how to generate N
uncorrelated uniform random variables, we could make use of the transform method. To do so, we need to express our N
dimensional random variable as the result of a very complex function applied to a simple N
dimensional random variable!
As most of the time in these cases, very complex function naturally implies neural network modelling. Then, the idea is to model the transform function by a neural network that
N
dimensional uniform random variable.N
dimensional random variable that should follow the right probability distribution, after training.Once the architecture of the network has been designed (just like different architectures of VAEs), we still need to train it.
So far, we have shown that our problem of generating a new image of dog can be rephrased into a problem of generating a random vector in the N
dimensional vector space that follows the "dog probability distribution" and we have suggested to use a transform method, with a neural network to model the transform function.
Now, we still need to train (optimize) the network to express the right transform function. To this purpose, we can suggest two different training methods:
However, we do not know how to express explicitly the true "dog probability distribution" and we can also say that the generated distribution is far too complex to be expressed explicitly. So, comparisons based on explicit expressions are not possible. However, if we have a way to compare probability distributions based on samples, we can use it to train the network. Indeed, we have a sample of true data and we can, at each iteration of the training process, produce a sample of generated data.
Although, in theory, any distance (or similarity measure) able to compare effectively two distributions based on samples can be used, we can mention in particular the Maximum Mean Discrepancy (MMD) approach. The MMD defines a distance between two probability distributions that can be computed (estimated) based on samples of these distributions.
So, once we have defined a way to compare two distributions based on samples, we can define the training process of the generative network in GMNs. The idea of GMNs is to optimize the network by repeating the following steps:
When following these steps we are applying a gradient descent over the network with a loss function that is the distance between the true and the generated distributions at the current iteration.
The brilliant idea that rules GANs consists in replacing this direct comparison by an indirect one that takes the form of a downstream task over these two distributions. The training of the generative network is then done with respect to this task such that it forces the generated distribution to get closer and closer to the true distribution.
The downstream task of GANs is a discrimination task between true and generated samples. Or we could say a "non-discrimination" tasks as we want the discrimination to fail as much as possible. So, in a GAN architecture, we have a discriminator, that takes samples of true and generated data and that try to classify them as well as possible. Let's see on a simple example why the direct and indirect approaches we mentioned should, in theory, lead to the same optimal generator.
In order to better understand why training a generator to fool a discriminator will lead to the same result as training directly the generator to match the target distribution, let’s take a simple one dimensional example.
Suppose that we have a true distribution, for example a one dimensional gaussian, and that we want a generator that samples from this probability distribution. What we called “direct” training method would then consist in adjusting iteratively the generator (gradient descent iterations) to correct the measured difference/error between true and generated distributions. Finally, assuming the optimisation process perfect, we should end up with the generated distribution that matches exactly the true distribution.
For the “indirect” approach, we have to consider also a discriminator. We assume for now that this discriminator is a kind of oracle that knows exactly what are the true and generated distribution and that is able, based on this information, to predict a class (“true” or “generated”) for any given point. If the two distributions are far appart, the discriminator will be able to classify easily and with a high level of confidence most of the points we present to it. If we want to fool the discriminator, we have to bring the generated distribution close to the true one. The discriminator will have the most difficulty to predict the class when the two distributions will be equal in all points: in this case, for each point there are equal chances for it to be “true” or “generated” and then the discriminator can’t do better than being true in one case out of two in average.
At this point, it seems legit to wonder whether this indirect method is really a good idea. Indeed, it seems to be more complicated (we have to optimise the generator based on a downstream task instead of directly based on the distributions) and it requires a discriminator that we consider here as a given oracle but that is, in reality, neither known nor perfect.
Let’s now describe the specific form that take the generator and the discriminator in the GANs architecture.
N
dimensional vector) and returns as output the probability of this point to be a “true” one.Notice that the fact that we impose now a parametrised model to express both the generator and the discriminator (instead of the idealised versions in the previous subsection) has, in practice, not a huge impact on the theoretical argument/intuition given above:
Once defined, the two networks can then be trained jointly (at the same time) with opposite goals :
So, at each iteration of the training process, the weights of the generative network are updated in order to increase the classification error (error gradient ascent over the generator’s parameters) whereas the weights of the discriminative network are updated so that to decrease this error (error gradient descent over the discriminator’s parameters).
These opposite goals and the implied notion of adversarial training of the two networks explains the name of “adversarial networks”: both networks try to beat each other and, doing so, they are both becoming better and better. The competition between them makes these two networks “progress” with respect to their respective goals. From a game theory point of view, we can think of this setting as a minimax two-players game where the equilibrium state corresponds to the situation where the generator produces data from the exact targeted distribution and where the discriminator predicts “true” or “generated” with probability 1/2 for any point it receives.
Neural networks modelling essentially requires to define two things: an architecture and a loss function. We have already described the architecture of Generative Adversarial Networks. It consists in two networks:
G(.)
that takes a random input z
with density p_z
and returns an output x_g = G(z)
that should follow (after training) the targeted probability distribution.D(.)
that takes an input x
that can be a “true” one (x_t
, whose density is denoted p_t
) or a “generated” one (x_g
, whose density p_g
is the density induced by the density p_z
going through G
) and that returns the probability D(x)
of x
to be a “true” data.Let’s take now a closer look at the “theoretical” loss function of GANs. If we send to the discriminator “true” and “generated” data in the same proportions, the expected absolute error of the discriminator can then be expressed as
The goal of the generator is to fool the discriminator whose goal is to be able to distinguish between true and generated data. So, when training the generator, we want to maximise this error while we try to minimise it for the discriminator. It gives us
For any given generator G
(along with the induced probability density p_g
), the best possible discriminator is the one that minimises
In order to minimise (with respect to D
) this integral, we can minimise the function inside the integral for each value of x
. It then defines the best possible discriminator for a given generator
(in fact, one of the best because x values such that p_t(x)=p_g(x)
could be handled in another way but it doesn’t matter for what follows). We then search G that maximises
Again, in order to maximise (with respect to G
) this integral, we can maximise the function inside the integral for each value of x
. As the density p_t
is independent of the generator G
, we can’t do better than setting G
such that
Of course, as p_g
is a probability density that should integrate to 1, we necessarily have for the best G
So, we have shown that, in an ideal case with unlimited capacities generator and discriminator, the optimal point of the adversarial setting is such that the generator produces the same density as the true density and the discriminator can’t do better than being true in one case out of two, just like the intuition told us. Finally, notice also that G
maximises
Under this form, we better see that G
wants to maximise the expected probability of the discriminator to be wrong.
The main takeaways of this article are:
Goodfellow I, et. al. Generative adversarial nets.