NeurIPS '19 | Generative modeling by estimating gradients of the data distribution.

Related Reference

Sander Dieleman. Diffusion models are autoencoders.
Sohl-Dickstein, Weiss, Maheswaranathan and Ganguli, Deep Unsupervised Learning using Nonequilibrium Thermodynamics.

Introduction

Diffusion models took off like a rocket at the end of 2019, after the publication of Song & Ermon’s seminal paper. In this paper-reading, I highlight a connection to another type of model: the venerable autoencoder.

Diffusion Models

diffuse2 Diffusion models are fast becoming the go-to model for any task that requires producing perceptual signals, such as images and sound. They provide similar fidelity as alternatives based on generative adversarial nets (GANs) #22 or autoregressive models, but with

much better mode converge than the GANs.
and a faster and more flexible sampling procedure compared to the autoregressive models.

In a nutshell, diffusion models are constructed by first describing a procedure for gradually turning data into noise, and then training a neural network that learns to invert this procedure step-by-step.

Each of these steps consists of taking a noisy input and making it slightly less noisy, by filling in some of the information obscured by the noise.

If you start from pure noise and do this enough times, it turns out you can generate data this way!

Diffusion models have been around for a while, but really took off at the end of 2019. The ideas are young enough that the field hasn't really settled one particular convention or paradigm to describe them, which means almost every paper uses a slightly different framing, and often a different notation as well. This can make it quite challenging to see the bigger picture when trawling through the literature, of which there is already a lot! Diffusion models go by many names:

denoising diffusion probabilistic models (DDPMs).
score-based generative models.
generative diffusion processes.

Some people just call them energy-based models (EBMs), of which they technically are a special case.

My personal favorite perspective starts from the idea of score matching, Hyvarinen et. al., and uses a formalism based on stochastic differential equations (SDEs), Sohl-Dickstein et. al.. For an in-depth treatment of diffusion models from this perspective, I strongly recommend Yang Song's richly illustrated blog post (which also comes with code and colabs). It is especially enlightening with regards to the connection between all these different perspectives. If you are familiar with variational autoencoders, you may find Lilian Weng or Jakub Tomczak's takes on this model family more approachable.

Denoising Autoencoders

bottleneck Autoencoders are neural networks that are trained to predict their input. In and of itself, this is a trivial and meaningless task, but it becomes more interesting

when the network architecture is restricted in some way.
when the input is corrupted and the network has to learn to undo this corruption.

A typical architectural restriction is to introduce some sort of bottleneck, which limits the amount of information that can pass through. This implies that the network must learn to encode the most important information efficiently to be able to pass it through the bottleneck, in order to be able to accurately reconstruct the input. Such a bottleneck can be created by reducing the capacity of a particular layer of the network,

by introducing quantisation (as in VQ-VAEs).
or by applying some form of regularization to it during training (as in VAEs or contractive autoencoders).

The internal representation used in this bottleneck (often referred to as the latent representation) is what we are really after. It should capture the essence of the input, while discarding a lot of irrelevant detail.

Corrupting the input is another viable strategy to make autoencoders learn useful representations. One could argue that models with corrupted input are not autoencoders in the strictest sense, because the input and target output differ, but this is really a semantic discussion - one could just as well consider the corruption procedure part of the model itself, In practice, such models are typically referred to as denoising autoencoders.

Denoising autoencoders were actually some of the first true “deep learning” models: back when we hadn’t yet figured out how to reliably train neural networks deeper than a few layers with simple gradient descent, the prevalent approach was to pre-train networks layer by layer, and denoising autoencoders were frequently used for this purpose Vincent et. al. (especially by Yoshua Bengio and colleagues at MILA – restricted Boltzmann machines were another option, favoured by Geoffrey Hinton and colleagues).

One and the same?

spiderman So what is the link between modern diffusion models and these - by deep learning standards - ancient autoencoders? I was inspired to ponder this connection a bit more after seeing some recent tweets speculating about autoencoders making a comback: Selection_003 As far as I'm concerned, the autoencoder comback is already in full swing, it's just that we call them diffusion models now! Let's unpack this.

The neural network that makes diffusion models tick is trained to estimate the so-called score function, $\nabla{x}logp(x)$, the gradient of the log-likelihood w.r.t. the input (a vector-valued function): $s{\theta}(x)=\nabla{x}logp{\theta}(x)$. Note that this is different from $\nabla{\theta}logp{\theta}(x)$, the gradient w.r.t. model parameters $\theta$, which is the one you would use for training if this were a likelihood-based model. The latter tells you how to change the model parameters to increase the likelihood of the input under the model, whereas the former tells you how to change the input itself to increase its likelihood. (This is the same gradient you would use for DeepDream-style manipulation of images).

In practice, we want to use the same network at every point in the gradual denoising process, i.e. at every noise level (from pure noise all the way to clean data). To account for this, it takes an additional input $t{\in}[0,1]$ which indicates how far along we are in the denoising process (position encoding?): $s{\theta}(x{t},t)=\nabla{x{t}}logp{\theta}(x{t}). By convention, $t=0$ corresponds to clean data and $t=1$ corresponds to pure noise, so we actually "go back in time" when denoising.

The way you train this network is by taking inputs $\textbf{x}$ and corrupting them with additive noise $\varepsilon{t} \sim \mathcal{N}(0,\sigma^{2}{t})$, and then predicting $\varepsilon{t}$ from $\textbf{x{t}}=\textbf{x}+\varepsilon_{t}$. The reason why this works is not entirely obvious. I recommend reading Pascal Vinvent's 2010 tech report on the subject for an in-depth explanation of why you can do this.

Note that the variance depends on $t$, because it corresponds to the specific noise level at time $t$. The loss function is typically just mean squared error. sometimes weighted by a scale factor $\lambda(t)$, so that some noise levels are prioritized over others:

$$arg\min{\theta}\mathcal{L}{\theta}=arg\min{\theta}\mathbb{E}{t,p(x{t})}\left[\lambda(t)||s{\theta{(x+\varepsilon{t},t)-\varepsilon{t}|^{2}_{2}\right]$$

Going forward, let's assume $\lambda(t)=1$, which is usually what is done in practice anyway (though other choices have their uses as well, []().

One key observation is that predicting $\varepsilon_{t}$ or $x$ are equivalent, so instead, we could just use

$$ arg\min{\theta}\mathbb{E}{t,p(x{t})}\left[||s'{\theta}(x+\varepsilon{t},t)-x||^{2}_{2}\right]$$

To see that they are equivalent, consider taking a trained model $s{\theta}$ that predicts $\varepsilon{t}$ and add a new residual connection to it, going all the way from the input to the output, with a scale factor of -1. This modified model then predicts:

$$\varepsilon{t}-x{t}=\varepsilon{t}-(x+\varepsilon{t})=-x$$

In other words, we obtain a denoising autoencoder (up to a minus sign). This might seem surprising, but intuitively, it actually makes sense that to increase the likelihood of a noisy input, you should probably just try to remove the noise, beause noise is inherently unpredictable. Indeed, it turns out that these two things are equivalent.

A tenuous connection?

bridge Of course, the title of this blog post is intentionally a bit facetious: while there is a deeper connection between diffusion models and autoencoders than many people realize, the models have completely different purposes and so are not interchangeable, e.g. denoising autoencoders are not equivalent to diffusion models.

There are two key differences with the denoising autoencoders of yore:

the additional input $t$ makes one single model able to handle many different noise levels with a single set of shared parameters.
we care about the output of the model, not the internal latent representation, so there is no need for a bottleneck. In fact, it would probably do more harm than good.

In the strictest sense, both of these differences have no bearing on whether the model can be considered an autoencoder or not. In practice, however, the point of an autoencoder is usually understood to be to learn a useful representation, so saying that diffusion models are autoencoders could perhaps be considered a bit... pedantic. Nevertheless, I wanted to highlight this connection between because I think many more people know ins and outs of autoencoders than diffusion models at this point. I believe that appreciating the link between the two can make the latter less daunting to understand.

The link is not merely a curiosity, by the way; it has also been the subject of several papers, which constitute an early exploration of the ideas that power modern diffusion models. Apart from the work by Pascal Vincent mentioned earlier, there is also a series of papers by Guillaume Alain and colleagues that are worth checking out.

Bengio et. al. Implicit density estimation by local moment matching to sample from auto-encoders.
Alain et. al. Regularized auto-encoders estimate local statistics.
Bengio et. al. Generalized denoising auto-encoders as generative models。
Alain et. al. What regularized auto-encoders learn from the data-generating distribution.
Bengio et. al. Deep generative stochastic networks trainable by backprop.
Alain et. al. GSNs: generative stochastic networks.

[Note that there is another way to connect diffusion models to autoencoders, by viewing them as (potentially infinitely) deep latent variable models. I am personally less interested in that connection because it doesn't provide me wit much additional insight, but it is just as valid. Here's a blog post by Angus Turner that explores this interpretation in detail.]

Noise and Scale

noisy_mountains I believe the idea of training a single model to handle many different noise levels with shared parameters is ultimately the key ingredient that made diffusion models really take off. Song & Ermon called them noise-conditional score networks (NCSNs) and provide a very lucid explanation of why this is important, which I won't repeat here.

The idea of using different noise levels in a single denoising autoencoder had previously been explored for representation learning, but not for generative modelling. Several works suggest gradually decreasing the level of noise over the course of training to improve the learnt representations, Geras et. al., Chandra et. al., Zhang et. al.. Composite denoising autoencoders have multiple subnetworks that handle different noise levels, which is a step closer to the score networks that we use in diffusion models, though still missing the parameter sharing.

A particularly interesting observation stemming from these networks, which is also highly relevant to diffusion models, is that

representations learnt using different noise levels tend to correspond to different scales of features: the higher the noise level, the larger-scale the features that are captured.

I think this connection is worth investigating further: it implies that diffusion models fill in missing parts of the input at progressively smaller scales, as the noise level decreases step by step. This does seem to be the case in practice, and it is potentially a useful feature. Concretely, it means that

$\lambda(t)$ can be designed to prioritize the modelling of particular feature scales

This is great, because excessive attention to detail is actually a major problem with likelihood-base models.

This connection between noise levels and feature scales was initially baffling to me: the noise $\varepsilon_{t}$ that we add to the input during training is isotropic Gaussian, so we are effectively adding noise to each input element (e.g. pixel) independently. If that is the case, how can the level of noise (i.e. the variance) possibly impact the scale of the features that are learnt? I found it helpful to think of it this way:

Let's say we are working with images. Each pixel in an image that could be part of a particular feature (e.g. a human face) provides evidence for the presence (or absence) of that feature.
When looking at an image, we implicitly aggregate the evidence provided by all the pixels to determine which features are present (e.g. whether there is a face in the image or not).
Larger-scale features in the image will cover a larger proportion of pixels. Therefore, if a larger-scale feature is present in an image, there is more evidence pointing towards that feature. (Just like the variance of random walks will increase through the course of walking? Larger-scale features need larger variance to distort them. Here we have the assumption that the information is summed over all pixels contained in the feature.)
Even if we add noise with a very high variance, that evidence will still be apparent, because when combining information from all pixels, we average out the noise.
If more pixels are involved in this process, the tolerable noise level increases, because the maximal variance that still allows for the noise to be canceled out is much higher. For smaller-scale features however, recovery will be impossible because the noise dominates when we can only aggregate information from a smaller set of pixels.

Concretely, if an image contains a human face and we add a lot of noise to it, we will probably no longer be able to discern the face if it is far away from the camera (i.e. covers fewer pixels in the image), whereas if it is close to the camera, we might still see a faint outline. The header image of this section provides another example: the level of noise decreases from left to right. On the very left, we can still see the rough outline of a mountain despite very high levels of noise.

This is completely handwavy, but it provides some intuition for why there is a direct correspondence between the variance of the noise and the scale of features captured by denoising autoencoders and score networks.

Closing thoughts

sunset So there you have it: diffusion models are autoencoders. Sort of. When you squint a bit. Here are some key takeaways, to wrap up:

Learning to predict the score function $\nabla_{x}logp(x)$ of a distribution can be achieved by learning to denoise examples of that distribution. This is a core underlying idea that powers modern diffusion models.
Compared to denoising autoencoders, score networks in diffusion models can handle all noise levels with a single set of parameters, and do not have bottlenecks. But other than that, they do the same thing.
Noise levels and features scales are closely linked: high noise levels lead to models capturing large-scale features, low noise levels lead to models focusing on fine-grained features.

NorbertZheng / read-papers