NeurIPS '14 | Generative adversarial nets.

NorbertZheng / read-papers

My paper reading notes.

MIT License

8 stars 0 forks source link

NeurIPS '14 | Generative adversarial nets. #22

Closed NorbertZheng closed 2 years ago

NorbertZheng commented 2 years ago

Goodfellow I, et. al. Generative adversarial nets.

NorbertZheng commented 2 years ago

Reference

Rocca, Joseph. Understanding Generative Adversarial Networks (GANs).

Resources

See here for the cheetsheet of GAN.

NorbertZheng commented 2 years ago

Introduction

In this post, we will see that adversarial training is an enlightening idea, beautiful by its simplicity, that represents real conceptual progress for Machine Learning and more especially for generative models (in the same way as backpropagation is a simple but really smart trick that made the ground idea of neural networks became so popular and efficient). In the following parts, we will overcome the apparent magic of GANs in order to dive into ideas, maths, and modeling behind these models.

Outline

In the first following section, we will discuss the process of generating random variables from a given distribution.
In section 2, we will show, through an example, that the problems GANs try to tackle can be expressed as random variable generation problems.
In section 3 we will discuss matching-based generative networks and show how they answer problems described in section 2.
In section 4 we will introduce GANs. More especially, we will present the general architecture with its loss function and we will make the link with all the previous parts.

NorbertZheng commented 2 years ago

Generating random variables

We remind some existing methods and more especially the inverse transform method that allows to generate complex random variables from simple uniform random variables (MCMC?).

Uniform random variables can be pseudo-randomly generated

Computers are fundamentally determinisitic. So, it is, in theory, impossible to generate numbers that are really random. However, a computer is able, using a pseudorandom number generator, to generate a sequence of numbers that approximately follows a uniform random distribution between 0 and 1.

The uniform case is a very simple one upon which more complex random variables can be built in different ways.

NorbertZheng commented 2 years ago

Random variables expressed as the result of an operation or a process

There exist different techniques that are aimed at generating more complex random variables. Among them we can find,

Inverse Transform Method.
Rejection Sampling.
Metropolis-Hasting (MH) algorithm.
...

All these methods rely on different mathematical tricks that mainly consist in representing the random variable we want to generate as the result of an operation (over simpler random variables) or a process.

Rejection Sampling expresses the random variables as the result of a process that

Sample not from the complex distribution but from a well-known simple distribution.
Accept or reject the sampled value depending on some condition.

Repeating this process until the sampled value is accepted, we can show that with the right condition of acceptance the value that will be efficiently sampled will follow the right distribution.

In the Metropolis-Hasting algorithm, the idea is to find a Markov Chain (MC) such that the stationary distribution of this MC corresponds to the distribution from which we would like to sample our random variable. Once this MC found, we can simulate a long enough trajectory over this MC to consider that we have reach a steady state and then the last value we obtain this way can be considered as having been drawn from the distribution of interest.

NorbertZheng commented 2 years ago

The inverse transform method

The idea of inverse transform method is simply to represent our complex random variables as the result of a function applied to a uniform random variable we know how to generate.

We consider in what follows a one dimensional example. Let X be a complex random variable we want to sample from and U be a uniform random variable over [0,1] we know how to sample from. We remind that a random variable is fully defined by its Cumulative Distribution Function (CDF). The CDF of a random variable is a function from the domain of definition of the random variable to the interval [0,1] and defined, in one dimension, such that

In the particular case of our uniform random variable U, we have

For simplicity, we will suppose here that the function CDF_X is invertible and its inverse is denoted

(the method can easily be extended to the non-invertible case by using the generalized inverse of the function but it is really not the main point we want to focus on here). Then if we define

we have

As we can see, Y and X have the same CDF and then define the same random variable. So, by defining Y as above (as a function of a uniform random variable) we have managed to define a random variable with the targeted distribution.

To summarize, inverse transform method is a way to generate a random variable that follows a given distribution by making a uniform random variable goes through a well designed "transform function" (inverse CDF). This notion of "inverse transform method" can, infact, be extended to the notion of "transform method" that consists, more generally, in generating random variables as function of some simpler random variables (not necessarily uniform and then the transform function is no longer the inverse CDF). Conceptually, the purpose of the "transform function" is to deform/reshape the initialial probability distribution:

The transform function takes from where the initial distribution is too high compared to the targeted distribution and puts it where it is too low.

NorbertZheng commented 2 years ago

Generative models

We try to generate very complex random variables...

The problem of generating a new image of dog is equivalent to the problem of generating a new vector following the desired probability distribution over the N dimensional vector space. So we are, in fact, facing a problem of generating a random variable with respect to a specific probability distribution.

At this point, we can mention two important things.

The "dog probability distribution" we mentioned is a very complex distribution over a very large space.
Even if we can assume the existence of such underlying distribution (there actually exists images that looks like dog and others that doesn’t) we obviously don’t know how to express explicitly this distribution.

Both previous points make the process of generating random variables from this distribution pretty difficult.

...so let's use transform method with a neural network as function!

How to directly generate complex random variables? As we know pretty well how to generate N uncorrelated uniform random variables, we could make use of the transform method. To do so, we need to express our N dimensional random variable as the result of a very complex function applied to a simple N dimensional random variable!

The transform function can’t be explicitly expressed and, then, we have to learn it from data.

As most of the time in these cases, very complex function naturally implies neural network modelling. Then, the idea is to model the transform function by a neural network that

Take as input a simple N dimensional uniform random variable.
Return as output another N dimensional random variable that should follow the right probability distribution, after training.

Once the architecture of the network has been designed (just like different architectures of VAEs), we still need to train it.

NorbertZheng commented 2 years ago

Generative Matching Networks

Training generative models

So far, we have shown that our problem of generating a new image of dog can be rephrased into a problem of generating a random vector in the N dimensional vector space that follows the "dog probability distribution" and we have suggested to use a transform method, with a neural network to model the transform function.

Now, we still need to train (optimize) the network to express the right transform function. To this purpose, we can suggest two different training methods:

The direct training method consists in comparing the true and the generated probability distributions and backpropagating the difference (the error) through the network. This is the idea that rules Generative Matching Networks (GMNs).
For the indirect training method, we do not directly compare the true and generated distributions. Instead, we train the generative network by making these two distributions go through a downstream task chosen such that the optimization process of the generative network with respect to the downstream task will enforce the generated distribution to be close to the true distribution. This last ide is the one behind Generative Adversarial Networks (GANs).

NorbertZheng commented 2 years ago

Comparing two probability distributions based on samples

However, we do not know how to express explicitly the true "dog probability distribution" and we can also say that the generated distribution is far too complex to be expressed explicitly. So, comparisons based on explicit expressions are not possible. However, if we have a way to compare probability distributions based on samples, we can use it to train the network. Indeed, we have a sample of true data and we can, at each iteration of the training process, produce a sample of generated data.

Although, in theory, any distance (or similarity measure) able to compare effectively two distributions based on samples can be used, we can mention in particular the Maximum Mean Discrepancy (MMD) approach. The MMD defines a distance between two probability distributions that can be computed (estimated) based on samples of these distributions.

NorbertZheng commented 2 years ago

Backpropagation of the distribution matching error

So, once we have defined a way to compare two distributions based on samples, we can define the training process of the generative network in GMNs. The idea of GMNs is to optimize the network by repeating the following steps:

Generate some uniform inputs.
Make these inputs go through the network and collect the generated outputs.
Compare the true "dog probability distribution" and the generated one based on the available samples (for example compute the MMD distance between the sample of true dog images and the sample of generated ones).
Use backpropagation to make one step of gradient descent to lower the distance (for example MMD) between true and generated distributions.

When following these steps we are applying a gradient descent over the network with a loss function that is the distance between the true and the generated distributions at the current iteration.

NorbertZheng commented 2 years ago

Generative Adversarial Networks

The "indirect" training method

The brilliant idea that rules GANs consists in replacing this direct comparison by an indirect one that takes the form of a downstream task over these two distributions. The training of the generative network is then done with respect to this task such that it forces the generated distribution to get closer and closer to the true distribution.

The downstream task of GANs is a discrimination task between true and generated samples. Or we could say a "non-discrimination" tasks as we want the discrimination to fail as much as possible. So, in a GAN architecture, we have a discriminator, that takes samples of true and generated data and that try to classify them as well as possible. Let's see on a simple example why the direct and indirect approaches we mentioned should, in theory, lead to the same optimal generator.

NorbertZheng commented 2 years ago

The ideal case: perfect generator and discriminator

In order to better understand why training a generator to fool a discriminator will lead to the same result as training directly the generator to match the target distribution, let’s take a simple one dimensional example.

Suppose that we have a true distribution, for example a one dimensional gaussian, and that we want a generator that samples from this probability distribution. What we called “direct” training method would then consist in adjusting iteratively the generator (gradient descent iterations) to correct the measured difference/error between true and generated distributions. Finally, assuming the optimisation process perfect, we should end up with the generated distribution that matches exactly the true distribution. test

For the “indirect” approach, we have to consider also a discriminator. We assume for now that this discriminator is a kind of oracle that knows exactly what are the true and generated distribution and that is able, based on this information, to predict a class (“true” or “generated”) for any given point. If the two distributions are far appart, the discriminator will be able to classify easily and with a high level of confidence most of the points we present to it. If we want to fool the discriminator, we have to bring the generated distribution close to the true one. The discriminator will have the most difficulty to predict the class when the two distributions will be equal in all points: in this case, for each point there are equal chances for it to be “true” or “generated” and then the discriminator can’t do better than being true in one case out of two in average. test

NorbertZheng commented 2 years ago

At this point, it seems legit to wonder whether this indirect method is really a good idea. Indeed, it seems to be more complicated (we have to optimise the generator based on a downstream task instead of directly based on the distributions) and it requires a discriminator that we consider here as a given oracle but that is, in reality, neither known nor perfect.

The difficulty of directly comparing two probability distributions based on samples counterbalances the apparent higher complexity of indirect method.
It is obvious that the discriminator is not known. However, it can be learned!

NorbertZheng commented 2 years ago

The approximation: adversarial neural networks

Let’s now describe the specific form that take the generator and the discriminator in the GANs architecture.

The generator is a neural network that models a transform function. It takes as input a simple random variable and must return, once trained, a random variable that follows the targeted distribution.
As the generator is very complicated and unknown, we decide to model the discriminator with another neural network. This neural network models a discriminative function. It takes as input a point (in our dog example a N dimensional vector) and returns as output the probability of this point to be a “true” one.

Notice that the fact that we impose now a parametrised model to express both the generator and the discriminator (instead of the idealised versions in the previous subsection) has, in practice, not a huge impact on the theoretical argument/intuition given above:

we just then work in some parametrised spaces instead of ideal full spaces and, so, the optimal points that we should reach in the ideal case can then be seen as “rounded” by the precision capacity of the parametrised models.

Once defined, the two networks can then be trained jointly (at the same time) with opposite goals :

The goal of the generator is to fool the discriminator, so the generative neural network is trained to maximise the final classification error (between true and generated data).
The goal of the discriminator is to detect fake generated data, so the discriminative neural network is trained to minimise the final classification error.

So, at each iteration of the training process, the weights of the generative network are updated in order to increase the classification error (error gradient ascent over the generator’s parameters) whereas the weights of the discriminative network are updated so that to decrease this error (error gradient descent over the discriminator’s parameters).

These opposite goals and the implied notion of adversarial training of the two networks explains the name of “adversarial networks”: both networks try to beat each other and, doing so, they are both becoming better and better. The competition between them makes these two networks “progress” with respect to their respective goals. From a game theory point of view, we can think of this setting as a minimax two-players game where the equilibrium state corresponds to the situation where the generator produces data from the exact targeted distribution and where the discriminator predicts “true” or “generated” with probability 1/2 for any point it receives.

NorbertZheng commented 2 years ago

Mathematical details about GANs

Neural networks modelling essentially requires to define two things: an architecture and a loss function. We have already described the architecture of Generative Adversarial Networks. It consists in two networks:

A generative network G(.) that takes a random input z with density p_z and returns an output x_g = G(z) that should follow (after training) the targeted probability distribution.
A discriminative network D(.) that takes an input x that can be a “true” one (x_t, whose density is denoted p_t) or a “generated” one (x_g, whose density p_g is the density induced by the density p_z going through G) and that returns the probability D(x) of x to be a “true” data.

Let’s take now a closer look at the “theoretical” loss function of GANs. If we send to the discriminator “true” and “generated” data in the same proportions, the expected absolute error of the discriminator can then be expressed as

The goal of the generator is to fool the discriminator whose goal is to be able to distinguish between true and generated data. So, when training the generator, we want to maximise this error while we try to minimise it for the discriminator. It gives us

For any given generator G (along with the induced probability density p_g), the best possible discriminator is the one that minimises

In order to minimise (with respect to D) this integral, we can minimise the function inside the integral for each value of x. It then defines the best possible discriminator for a given generator

(in fact, one of the best because x values such that p_t(x)=p_g(x) could be handled in another way but it doesn’t matter for what follows). We then search G that maximises

Again, in order to maximise (with respect to G) this integral, we can maximise the function inside the integral for each value of x. As the density p_t is independent of the generator G, we can’t do better than setting G such that

Of course, as p_g is a probability density that should integrate to 1, we necessarily have for the best G

So, we have shown that, in an ideal case with unlimited capacities generator and discriminator, the optimal point of the adversarial setting is such that the generator produces the same density as the true density and the discriminator can’t do better than being true in one case out of two, just like the intuition told us. Finally, notice also that G maximises

Under this form, we better see that G wants to maximise the expected probability of the discriminator to be wrong.

NorbertZheng commented 2 years ago

Takeaways

The main takeaways of this article are:

Computers can basically generate simple pseudo-random variables (for example they can generate variables that follow very closely a uniform distribution).
There exists different ways to generate more complex random variables including the notion of “transform method” that consists in expressing a random variable as a function of some simpler random variable(s).
In machine learning, the generative models try to generate data from a given (complex) probability distribution.
Deep learning generative models are modelled as neural networks (very complex functions) that take as input a simple random variable and that return a random variable that follows the targeted distribution (“transform method” like).
These generative networks can be trained “directly” (by comparing the distribution of generated data to the true distribution): this is the idea of Generative Matching Networks.
These generative networks can also be trained “indirectly” (by trying to fool another network that is trained at the same time to distinguish “generated” data from “true” data): this is the idea of Generative Adversarial Networks.