NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

NeurIPS '19 | Generative modeling by estimating gradients of the data distribution. #25

Closed NorbertZheng closed 2 years ago

NorbertZheng commented 2 years ago

Song Y, Ermon S. Generative modeling by estimating gradients of the data distribution.

NorbertZheng commented 2 years ago

Related Reference

NorbertZheng commented 2 years ago

Introduction

Diffusion models took off like a rocket at the end of 2019, after the publication of Song & Ermon’s seminal paper. In this paper-reading, I highlight a connection to another type of model: the venerable autoencoder.

NorbertZheng commented 2 years ago

Diffusion Models

diffuse2 Diffusion models are fast becoming the go-to model for any task that requires producing perceptual signals, such as images and sound. They provide similar fidelity as alternatives based on generative adversarial nets (GANs) #22 or autoregressive models, but with

In a nutshell, diffusion models are constructed by first describing a procedure for gradually turning data into noise, and then training a neural network that learns to invert this procedure step-by-step.

If you start from pure noise and do this enough times, it turns out you can generate data this way!

Diffusion models have been around for a while, but really took off at the end of 2019. The ideas are young enough that the field hasn't really settled one particular convention or paradigm to describe them, which means almost every paper uses a slightly different framing, and often a different notation as well. This can make it quite challenging to see the bigger picture when trawling through the literature, of which there is already a lot! Diffusion models go by many names:

Some people just call them energy-based models (EBMs), of which they technically are a special case.

My personal favorite perspective starts from the idea of score matching, Hyvarinen et. al., and uses a formalism based on stochastic differential equations (SDEs), Sohl-Dickstein et. al.. For an in-depth treatment of diffusion models from this perspective, I strongly recommend Yang Song's richly illustrated blog post (which also comes with code and colabs). It is especially enlightening with regards to the connection between all these different perspectives. If you are familiar with variational autoencoders, you may find Lilian Weng or Jakub Tomczak's takes on this model family more approachable.

NorbertZheng commented 2 years ago

Denoising Autoencoders

bottleneck Autoencoders are neural networks that are trained to predict their input. In and of itself, this is a trivial and meaningless task, but it becomes more interesting

A typical architectural restriction is to introduce some sort of bottleneck, which limits the amount of information that can pass through. This implies that the network must learn to encode the most important information efficiently to be able to pass it through the bottleneck, in order to be able to accurately reconstruct the input. Such a bottleneck can be created by reducing the capacity of a particular layer of the network,

The internal representation used in this bottleneck (often referred to as the latent representation) is what we are really after. It should capture the essence of the input, while discarding a lot of irrelevant detail.

Corrupting the input is another viable strategy to make autoencoders learn useful representations. One could argue that models with corrupted input are not autoencoders in the strictest sense, because the input and target output differ, but this is really a semantic discussion - one could just as well consider the corruption procedure part of the model itself, In practice, such models are typically referred to as denoising autoencoders.

Denoising autoencoders were actually some of the first true “deep learning” models: back when we hadn’t yet figured out how to reliably train neural networks deeper than a few layers with simple gradient descent, the prevalent approach was to pre-train networks layer by layer, and denoising autoencoders were frequently used for this purpose Vincent et. al. (especially by Yoshua Bengio and colleagues at MILA – restricted Boltzmann machines were another option, favoured by Geoffrey Hinton and colleagues).

NorbertZheng commented 2 years ago

One and the same?

spiderman So what is the link between modern diffusion models and these - by deep learning standards - ancient autoencoders? I was inspired to ponder this connection a bit more after seeing some recent tweets speculating about autoencoders making a comback: Selection_003 As far as I'm concerned, the autoencoder comback is already in full swing, it's just that we call them diffusion models now! Let's unpack this.

The neural network that makes diffusion models tick is trained to estimate the so-called score function, $\nabla{x}logp(x)$, the gradient of the log-likelihood w.r.t. the input (a vector-valued function): $s{\theta}(x)=\nabla{x}logp{\theta}(x)$. Note that this is different from $\nabla{\theta}logp{\theta}(x)$, the gradient w.r.t. model parameters $\theta$, which is the one you would use for training if this were a likelihood-based model. The latter tells you how to change the model parameters to increase the likelihood of the input under the model, whereas the former tells you how to change the input itself to increase its likelihood. (This is the same gradient you would use for DeepDream-style manipulation of images).

In practice, we want to use the same network at every point in the gradual denoising process, i.e. at every noise level (from pure noise all the way to clean data). To account for this, it takes an additional input $t{\in}[0,1]$ which indicates how far along we are in the denoising process (position encoding?): $s{\theta}(x{t},t)=\nabla{x{t}}logp{\theta}(x{t}). By convention, $t=0$ corresponds to clean data and $t=1$ corresponds to pure noise, so we actually "go back in time" when denoising.

The way you train this network is by taking inputs $\textbf{x}$ and corrupting them with additive noise $\varepsilon{t} \sim \mathcal{N}(0,\sigma^{2}{t})$, and then predicting $\varepsilon{t}$ from $\textbf{x{t}}=\textbf{x}+\varepsilon_{t}$. The reason why this works is not entirely obvious. I recommend reading Pascal Vinvent's 2010 tech report on the subject for an in-depth explanation of why you can do this.

Note that the variance depends on $t$, because it corresponds to the specific noise level at time $t$. The loss function is typically just mean squared error. sometimes weighted by a scale factor $\lambda(t)$, so that some noise levels are prioritized over others:

$$arg\min{\theta}\mathcal{L}{\theta}=arg\min{\theta}\mathbb{E}{t,p(x{t})}\left[\lambda(t)||s{\theta{(x+\varepsilon{t},t)-\varepsilon{t}|^{2}_{2}\right]$$

Going forward, let's assume $\lambda(t)=1$, which is usually what is done in practice anyway (though other choices have their uses as well, []().

One key observation is that predicting $\varepsilon_{t}$ or $x$ are equivalent, so instead, we could just use

$$ arg\min{\theta}\mathbb{E}{t,p(x{t})}\left[||s'{\theta}(x+\varepsilon{t},t)-x||^{2}_{2}\right]$$

To see that they are equivalent, consider taking a trained model $s{\theta}$ that predicts $\varepsilon{t}$ and add a new residual connection to it, going all the way from the input to the output, with a scale factor of -1. This modified model then predicts:

$$\varepsilon{t}-x{t}=\varepsilon{t}-(x+\varepsilon{t})=-x$$

In other words, we obtain a denoising autoencoder (up to a minus sign). This might seem surprising, but intuitively, it actually makes sense that to increase the likelihood of a noisy input, you should probably just try to remove the noise, beause noise is inherently unpredictable. Indeed, it turns out that these two things are equivalent.

NorbertZheng commented 2 years ago

A tenuous connection?

bridge Of course, the title of this blog post is intentionally a bit facetious: while there is a deeper connection between diffusion models and autoencoders than many people realize, the models have completely different purposes and so are not interchangeable, e.g. denoising autoencoders are not equivalent to diffusion models.

There are two key differences with the denoising autoencoders of yore:

In the strictest sense, both of these differences have no bearing on whether the model can be considered an autoencoder or not. In practice, however, the point of an autoencoder is usually understood to be to learn a useful representation, so saying that diffusion models are autoencoders could perhaps be considered a bit... pedantic. Nevertheless, I wanted to highlight this connection between because I think many more people know ins and outs of autoencoders than diffusion models at this point. I believe that appreciating the link between the two can make the latter less daunting to understand.

The link is not merely a curiosity, by the way; it has also been the subject of several papers, which constitute an early exploration of the ideas that power modern diffusion models. Apart from the work by Pascal Vincent mentioned earlier, there is also a series of papers by Guillaume Alain and colleagues that are worth checking out.

[Note that there is another way to connect diffusion models to autoencoders, by viewing them as (potentially infinitely) deep latent variable models. I am personally less interested in that connection because it doesn't provide me wit much additional insight, but it is just as valid. Here's a blog post by Angus Turner that explores this interpretation in detail.]

NorbertZheng commented 2 years ago

Noise and Scale

noisy_mountains I believe the idea of training a single model to handle many different noise levels with shared parameters is ultimately the key ingredient that made diffusion models really take off. Song & Ermon called them noise-conditional score networks (NCSNs) and provide a very lucid explanation of why this is important, which I won't repeat here.

The idea of using different noise levels in a single denoising autoencoder had previously been explored for representation learning, but not for generative modelling. Several works suggest gradually decreasing the level of noise over the course of training to improve the learnt representations, Geras et. al., Chandra et. al., Zhang et. al.. Composite denoising autoencoders have multiple subnetworks that handle different noise levels, which is a step closer to the score networks that we use in diffusion models, though still missing the parameter sharing.

A particularly interesting observation stemming from these networks, which is also highly relevant to diffusion models, is that

I think this connection is worth investigating further: it implies that diffusion models fill in missing parts of the input at progressively smaller scales, as the noise level decreases step by step. This does seem to be the case in practice, and it is potentially a useful feature. Concretely, it means that

This is great, because excessive attention to detail is actually a major problem with likelihood-base models.

This connection between noise levels and feature scales was initially baffling to me: the noise $\varepsilon_{t}$ that we add to the input during training is isotropic Gaussian, so we are effectively adding noise to each input element (e.g. pixel) independently. If that is the case, how can the level of noise (i.e. the variance) possibly impact the scale of the features that are learnt? I found it helpful to think of it this way:

Concretely, if an image contains a human face and we add a lot of noise to it, we will probably no longer be able to discern the face if it is far away from the camera (i.e. covers fewer pixels in the image), whereas if it is close to the camera, we might still see a faint outline. The header image of this section provides another example: the level of noise decreases from left to right. On the very left, we can still see the rough outline of a mountain despite very high levels of noise.

This is completely handwavy, but it provides some intuition for why there is a direct correspondence between the variance of the noise and the scale of features captured by denoising autoencoders and score networks.

NorbertZheng commented 2 years ago

Closing thoughts

sunset So there you have it: diffusion models are autoencoders. Sort of. When you squint a bit. Here are some key takeaways, to wrap up: