NeurIPS '19 | Generative modeling by estimating gradients of the data distribution.

NorbertZheng / read-papers

My paper reading notes.

MIT License

8 stars 0 forks source link

NeurIPS '19 | Generative modeling by estimating gradients of the data distribution. #33

Closed NorbertZheng closed 2 years ago

NorbertZheng commented 2 years ago

Song Y, Ermon S. Generative modeling by estimating gradients of the data distribution. Yang Song. Generative Modeling by Estimating Gradients of the Data Distribution. Sayantan Das. Inject Noise to Remove Noise: A Deep Dive into Score-Based Generative Modeling Techniques.

NorbertZheng commented 2 years ago

Related Reference

Larochelle H, Murray I. The neural autoregressive distribution estimator.
Germain M, Gregor K, Murray I, et al. Made: Masked autoencoder for distribution estimation.
Van Den Oord A, Kalchbrenner N, Kavukcuoglu K. Pixel recurrent neural networks.
Dinh L, Krueger D, Bengio Y. Nice: Non-linear independent components estimation.
Dinh L, Sohl-Dickstein J, Bengio S. Density estimation using real nvp.
LeCun Y, Chopra S, Hadsell R, et al. A tutorial on energy-based learning.
Song Y, Kingma D P. How to train your energy-based models.
Kingma D P, Welling M. Auto-encoding variational bayes.
Rezende D J, Mohamed S, Wierstra D. Stochastic backpropagation and approximate inference in deep generative models.
Mohamed S, Lakshminarayanan B. Learning in implicit generative models.
Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks.
Salimans T, Goodfellow I, Zaremba W, et al. Improved techniques for training gans.
Metz L, Poole B, Pfau D, et al. Unrolled generative adversarial networks.
Liu Q, Lee J, Jordan M. A kernelized Stein discrepancy for goodness-of-fit tests.
Hyvärinen A, Dayan P. Estimation of non-normalized statistical models by score matching.
Vincent P. A connection between score matching and denoising autoencoders.
Song Y, Ermon S. Generative modeling by estimating gradients of the data distribution.
Song Y, Ermon S. Improved techniques for training score-based generative models.
Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models.
Song Y, Sohl-Dickstein J, Kingma D P, et al. Score-based generative modeling through stochastic differential equations.
Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis.
Ho J, Saharia C, Chan W, et al. Cascaded Diffusion Models for High Fidelity Image Generation.
Chen N, Zhang Y, Zen H, et al. WaveGrad: Estimating gradients for waveform generation.
Kong Z, Ping W, Huang J, et al. Diffwave: A versatile diffusion model for audio synthesis.
Popov V, Vovk I, Gogoryan V, et al. Grad-tts: A diffusion probabilistic model for text-to-speech.
Cai R, Yang G, Averbuch-Elor H, et al. Learning gradient fields for shape generation.
Mittal G, Engel J, Hawthorne C, et al. Symbolic music generation with diffusion models.
Jalal A, Arvinte M, Daras G, et al. Robust compressed sensing mri with deep generative priors.
Hinton G E. Training products of experts by minimizing contrastive divergence.
Song Y, Garg S, Shi J, et al. Sliced score matching: A scalable approach to density and score estimation.
Parisi G. Correlation functions and computer simulations.
Grenander U, Miller M I. Representations of knowledge in complex systems.
Jolicoeur-Martineau A, Piché-Taillefer R, Combes R T, et al. Adversarial score matching and improved sampling for image generation.
Anderson B D O. Reverse-time diffusion equation models.
Song Y, Durkan C, Murray I, et al. Maximum likelihood training of score-based diffusion models.
Jolicoeur-Martineau A, Li K, Piché-Taillefer R, et al. Gotta go fast when generating data with score-based models.
Heusel M, Ramsauer H, Unterthiner T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium.
Karras T, Aittala M, Hellsten J, et al. Training generative adversarial networks with limited data.
Chen R T Q, Rubanova Y, Bettencourt J, et al. Neural ordinary differential equations.
Grathwohl W, Chen R T Q, Bettencourt J, et al. Scalable reversible generative models with free-form continuous dynamics.
Song Y, Shen L, Xing L, et al. Solving inverse problems in medical imaging with score-based generative models.
Neal R M. Annealed importance sampling.
Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al. Deep unsupervised learning using nonequilibrium thermodynamics.
Bordes F, Honari S, Vincent P. Learning to generate samples from noise through infusion training.
ALIAS PARTH GOYAL A G, Ke N R, Ganguli S, et al. Variational walkback: Learning a transition operator as a stochastic recurrent net.
Alain G, Bengio Y, Yao L, et al. GSNs: generative stochastic networks.
De Bortoli V, Thornton J, Heng J, et al. Diffusion Schrödinger bridge with applications to score-based generative modeling.
Song J, Meng C, Ermon S. Denoising diffusion implicit models.
Luhman E, Luhman T. Knowledge distillation in iterative generative models for improved sampling speed.
Vahdat A, Kreis K, Kautz J. Score-based generative modeling in latent space.

NorbertZheng commented 2 years ago

Overview

This blog post focuses on a promising new direction for generative modeling. We can learn score functions $\nabla{x}logp{\theta}(x)$ (gradients of log probability density functions) on a large number of noise-perturbed data distributions, then generate samples with Langevin-type sampling. The resulting generative models, often called score-based generative models, has several important advantages over existing model families: GAN-level sample quality without adversarial training, flexible model architectures, exact log-likelihood computation, and inverse problem solving without re-training models. In this blog post, we will show you in more detail the intuition, basic concepts, and potential applications of score-based generative models.

NorbertZheng commented 2 years ago

Introduction

Existing generative modeling techniques can largely be grouped into two categories based on how they represent probability distributions.

Likelihood-based models

Likelihood-based models directly learn the distribution's probability density (or mass) function via (approximate) maximum likelihood. Typical likelihood-based models include

autoregressive models [1, 2, 3],
normalizing flow models [4, 5],
energy-based models (EBMs) [6, 7],
variational auto-encoders (VAEs) [8, 9].

Bayesian networks, Markov random fields (MRF), autoregressive models, and normalizing flow models are all examples of likelihood-based models. All these models represent the probability density or mass function of a distribution.

Implicit generative models

In implicit generative models, the probability distribution is implicitly represented by a model of its sampling process. The most prominent example is generative adversarial networks (GANs), where new samples from the data distribution are synthesized by transforming a random Gaussian vector with a neural network.

GAN is an example of implicit models. Diffusion models should also be implicit generative models. It implicitly represents a distribution over all objects that can be produced by the generator network.

NorbertZheng commented 2 years ago

Likelihood-based models and implicit generative models, however, both have significant limitations.

Likelihood-based models either require strong restrictions on the model architecture to ensure a tractable normalizing constant for likelihood computation or must rely on surrogate objectives to approximate maximum likelihood training.
Implicit generative models, on the other hand, often require adversarial training, which is notoriously unstable and can lead to mode collapse.

In this blog post, I will introduce another way to represent probability distributions that may circumvent several of these limitations. The key idea is to

model the gradient of the log probability density function, a quantity often known as the (Stein) score function $\nabla{x}p{\theta}(x)$.

Such score-based models are not required to have a tractable normalizing constant, and can be directly learned by score matching [15, 16].

Score function (the vector field) and density function (contours) of a mixture of two Gaussians. score_contour

Score-based models have achieved state-of-the-art performance on many downstream tasks and applications. These tasks include, among others, image generation [17, 18, 19, 20, 21, 22] (Yes, better than GANs!), audio synthesis [23, 24. 25], shape generation, and music generation.

Moreover, score-based models have connections to normalizing flow models, therefore allowing exact likelihood computation and representation learning.

Additionally, modeling and estimating scores facilitates inverse problem solving, with applications such as image inpainting [17, 20], image colorization, compressive sensing, and medical image reconstruction (e.g., CT, MRI).

1024 x 1024 samples generated from score-based models. ffhq_samples

This post aims to show you the motivation and intuition of score-based generative modeling, as well as its basic concepts, properties and applications.

NorbertZheng commented 2 years ago

The score function, score-based models, and score matching

Suppose we are given a dataset { $\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N$ }, where each point is drawn independently from an underlying data distribution $p(x)$. Given this dataset, the goal of generative modeling is to fit a model to the data distribution such that we can synthesize new data points at will by sampling from the distribution.

In order to build such a generative model, we first need to way to represent a probability distrtibution. One such way, as in likelihood-based models, is to directly model the probability density function (p.d.f.) or probability mass function (p.m.f). Let $f_{\theta}(x) \in \mathbb{R}$ be a real-valued function parameterized by a learnable parameter $\theta$. Hereafter we only consider probability density functions. Probability mass functions are similar. We can define a p.d.f. via

$$ \begin{align} p\theta(\mathbf{x}) = \frac{e^{-f\theta(\mathbf{x})}}{Z_\theta}, \end{align} $$

, where $Z{\theta} > 0$ is a normalizing constant dependent on $\theta$, such that $\int p{\theta}(x)dx=1$. Here the function $f_{\theta}(x)$ is often called an unnormalized probabilistic model, or energy-based model.

We can train $p_{\theta}(x)$ by maximizing the log-likelihood of the data

$$ \begin{align} \max\theta \sum{i=1}^N \log p_\theta(\mathbf{x}_i). \end{align} $$

However, the above equation requires $p{\theta}(x)$ to be a normalized probability density function. This is undesirable because in order to compute $p{\theta}(x)$, we must evaluate the normalizing constant $Z{\theta}$ - a typically intractable quantity for any general $f{\theta}(x)$. Thus to make maximum likelihood training feasible, likelihood-based models must

either restrict their model architectures (e.g. causal convolutions in autoregressive models, invertible networks in normalizing flow models) to make $Z_{\theta}$ tractable,
or approximate the normalizing constant (e.g. variational inference in VAEs, or MCMC sampling used in constrastive divergence) which may be computationally expensive.

NorbertZheng commented 2 years ago

By modeling the score function instead of the density function, we can sidestep the difficulty of intractable normalizing constants. The score function of a distribution $p(x)$ is defined as

$$ \nabla_{x}logp(x) $$

and a model for the score function is called a score-based model, which we denote as $s_{\theta}(x)$. The score-based model is learned such that

$$ s{\theta}(x)\approx\nabla{x}logp(x) $$

, and can be parameterized without worrying about the normalizing constant. For example, we can easily parameterize a score-based model with the energy-based model defined in

$$ \begin{align} p\theta(\mathbf{x}) = \frac{e^{-f\theta(\mathbf{x})}}{Z_\theta}, \end{align} $$

, via

$$ \begin{equation} \mathbf{s}_\theta (\mathbf{x}) = \nabla_{\mathbf{x}} \log p\theta (\mathbf{x} ) = -\nabla_{\mathbf{x}} f\theta (\mathbf{x}) - \underbrace{\nabla\mathbf{x} \log Z\theta}_{=0} = -\nabla\mathbf{x} f\theta(\mathbf{x}). \end{equation} $$

Note that the score-based model $s{\theta}(x)$ is independent of the normalizing constant $Z{\theta}$!

This significantly expands the family of models that we can tractably use, since we don't need any special architectures to make the normalizing constant tractable.

Parameterizing probability density functions. No matter how you change the model family and parameters, it has to be normalized (area under the curve must integrate to one). ebm

Parameterizing score functions. No need to worry about normalization. score

Similar to likelihood-based models, we can train score-based models by minimizing the Fisher divergence between the model and the data distributions (Here we slightly abuse the term as the name of a closely related expression for score-based models.), defined as

$$ \begin{equation} \mathbb{E}_{p(\mathbf{x})}[| \nabla\mathbf{x} \log p(\mathbf{x}) - \mathbf{s}\theta(\mathbf{x}) |_2^2] \end{equation} $$

Intuitively, the Fisher divergence compares the squared $\ell2$ distance between the ground-truth data score and the score-based model. Directly computing this divergence, however, is infeasible because it requires access to the unknown data score $\nabla{x}logp(x)$. Fortunately, there exists a family of methods called score-matching, i.e. denoising score matching, sliced score matching that minimize the Fisher divergence without knowledge of the ground-truth data score. Score matching objectives can directly be estimated on a dataset and optimized with stochastic gradient descent, analogous to the log-likelihood objective for training likelihood-based models (with known normalizing constants).

We can train the score-based model by minimizing a score-matching objective, without requiring adversarial optimization.

Additionally, using the score matching objective gives us a considerable amount of modeling flexibility. The Fisher divergence itself does not require $s_{\theta}(x)$ to be an actual score function of any normalized distribution - it simply compares the $\ell2$ distance between the ground-truth data score and the score-based model, with no additional assumptions on the form $s{\theta}(x)$.

In fact, the only requirement on the score-based model is that it should be a vector-valued function with the same input and output dimensionality, which is easy to satisfy in practice.

As a brief summary, we can represent a distribution by modeling its score function, which can be estimated by training a score-based model of free-form architectures with score matching.

NorbertZheng commented 2 years ago

Langevin dynamics

Once we have trained a score-based model $s{\theta}(x)\approx\nabla{x}logp(x)$, we can use an iterative procedure called Langevin dynamics [[31](), [32]()] to draw samples from it.

Langevin dynamics provides an MCMC procedure to sample from a distribution $p(x)$ using only its score function $\nabla{x}logp(x)$. Specifically, it initializes the chain from an arbitrary prior distribution $x{0} \sim \pi(x)$, and then iterates the following

$$ \begin{align} \mathbf{x}_{i+1} \gets \mathbf{x}i + \epsilon \nabla\mathbf{x} \log p(\mathbf{x}) + \sqrt{2\epsilon}~ \mathbf{z}_i, \quad i=0,1,\cdots, K, \end{align} $$

, where $z_i \sim \mathcal{N}(0, I)$.

When $\epsilon \to 0$ and $K \to \infty$, $x_{K}$ obtained from the procedure in the above equation converges to a sample from $p(x)$ under some regularity conditions.

In practice, the error is negligible when $\epsilon$ is sufficiently small and $K$ is sufficiently large.

Using Langevin dynamics to sample from a mixture of two Gaussians.

Note that Langevin dynamics accesses $p(x)$ only through $\nabla{x}logp(x)$. Since $s{\theta}(x) \approx \nabla{x}logp(x)$, we can produce samples from our score-based model $s{\theta}(x)$ by plugging it into the above equation.

NorbertZheng commented 2 years ago

Naive score-based generative modeling and its pitfalls

So far, we've discussed how to train a score-based model with score matching, and then produce samples via Langevin dynamics. However, this naive approach has had limited success in practice - we'll talk about some pitfalls of score matching that received little attention in prior works [17].

Score-based generative modeling with score matching + Langevin dynamics. smld

The key challenge is the fact that

the estimated score functions are inaccurate in low-density regions, where few data points are available for computing the score matching objective.

This is expected as score matching minimizes the Fisher divergence

$$ \mathbb{E}_{p(\mathbf{x})}[| \nabla\mathbf{x} \log p(\mathbf{x}) - \mathbf{s}\theta(\mathbf{x}) |2^2] = \int p(\mathbf{x}) | \nabla\mathbf{x} \log p(\mathbf{x}) - \mathbf{s}_\theta(\mathbf{x}) |_2^2 \mathrm{d}\mathbf{x}. $$

Since the $\ell_2$ differences between the true data score function and score-based model are weighted by $p(x)$, they are largely ignored in low-density regions where $p(x)$ is small. This behavior can lead to subpar results, as illustrated by the figure below, estimated scores are only accurate in high-density regions. pitfalls

When sampling with Langevin dynamics, our initial sample is highly likely in low-density regions when data reside in a high-dimensional space. Therefore, having an inaccurate score-based model will derail Langevin dynamics from the very beginning of the procedure, preventing it from generating high quality samples that are representative of the data.

NorbertZheng commented 2 years ago

Score-based generative modeling with multiple noise perturbations

How can we bypass the difficulty of accurate score estimation in regions of low data density? Our solution is to

perturb data points with noise and train score-based models on the noisy data points instead.

When the noise magnitude is sufficiently large, it can populate low data density regions to improve the accuracy of estimated scores. For example, here is what happens when we perturb a mixture of two Gaussians perturbed by addtional Gaussian noise.

Estimated scores are accurate everywhere for the noise-perturbed data distribution due to reduced low data density regions. It seems that we modify the sample process on the distribution.

We pour more sample opportunities into low probability density regions, distorting the original sampling probability density.

Yet another question remains: how do we choose an appropriate noise scale for the perturbation process?

Larger noise can obviously cover more low-density regions for better score estimation, but it over-corrupts the data and alters it significantly from the original distribution. Seems that sample density is bound to distribution density? After all, the model assumes that all samples are from the original distribution.
Small noise, on the other hand, causes less corruption of the original data distribution, but does not cover the low-density regions as well as we would like.

To achieve the best of both worlds, we use multiple scales of noise perturbations simultaneously [17, 18]. Suppose we always perturb the data with isotropic Gaussian noise, and let there be a total of $L$ increasing standard deviations $\sigma_1 < \sigma_2 < \cdots < \sigma_L$. We first perturb the data distribution $p(x)$ with each of the Gaussian noise $\mathcal{N}(0, \sigma_i^2 I), i=1,2,\cdots,L$ to obtain a noise-perturbed distribution

$$ p_{\sigma_i}(\mathbf{x}) = \int p(\mathbf{y}) \mathcal{N}(\mathbf{x}; \mathbf{y}, \sigma_i^2 I) \mathrm{d} \mathbf{y}. $$

Note that we can easily draw samples from $p_{\sigma_i}(\mathbf{x})$ by sampling $\mathbf{x} \sim p(\mathbf{x})$ and computing $\mathbf{x} + \sigma_i \mathbf{z}$, with $\mathbf{z} \sim \mathcal{N}(0, I)$, which does not influence the sampling process over the original data distribution.

Next, we estimate the score function of each noise-perturbed distribution,

$$ \nabla\mathbf{x} \log p{\sigma_i}(\mathbf{x}) $$

, by training a Noise Conditioned Score-Based Model $\mathbf{s}_\theta(\mathbf{x}, i)$ (also called a Noise Conditional Score Network, or NCSN [17, 18, 20], when parameterized with a neural network) with score mathcing, such that $\mathbf{s}_\theta(\mathbf{x}, i) \approx \nabla\mathbf{x} \log p{\sigma_i}(\mathbf{x})$ for all $i= 1, 2, \cdots, L$.

We apply multiple scales of Gaussian noise to perturb the data distribution (first row), and jointly estimate the score functions for all of them (second row). multi_scale

Perturbing an image with multiple scales of Gaussian noise. duoduo

NorbertZheng commented 2 years ago

The training objective for $\mathbf{s}_\theta(\mathbf{x}, i)$ is a weighted sum of Fisher divergences for all noise scales. In particular, we use the objective below:

$$ \begin{equation} \sum{i=1}^L \lambda(i) \mathbb{E}_{p{\sigmai}(\mathbf{x})}[| \nabla\mathbf{x} \log p_{\sigmai}(\mathbf{x}) - \mathbf{s}\theta(\mathbf{x}, i) |_2^2], \end{equation} $$

, where $\lambda(i) \in \mathbb{R}_{>0}$ is a positive weighting function, often chosen to be $\lambda(i) = \sigmai^2$. The above objective can be optimized with score matching, exactly as in optimizing the naive (unconditional) score-based model $\mathbf{s}\theta(\mathbf{x})$.

After training our noise-conditional score-based model $\mathbf{s}_\theta(\mathbf{x}, i)$, we can produce samples from it by running Langevin dynamics for $i = L, L-1, \cdots, 1$ in sequence. This method is called annealed Langevin dynamics (defined by Algorithm 1 in [17], and improved by [18, 33]), since the noise scale $\sigma_{i}$ decreases (anneals) gradually over time.

Annealed Langevin dynamics combine a sequence of Langevin chains with gradually decreasing noise scales. ald

Annealed Langevin dynamics for the Noise Conditional Score Network (NCSN) model (from ref. [17]) trained on CelebA (left) and CIFAR-10 (right). We can start from unstructured noise, modify images according to the scores, and generate nice samples. The method achieved state-of-the-art Inception score on CIFAR-10 at its time. cifar10_large

NorbertZheng commented 2 years ago

Here are some practical recommendations for tuning score-based generative models with multiple noise scales:

Choose $\sigma_1 < \sigma_2 < \cdots < \sigma_L$ as a geometric progression, with $\sigma_1$ being sufficiently small and $\sigma_L$ comparable to the maximum distance between all training data points [18]. $L$ is typically on the order of hundreds ot thousands.
Parameterize the score-based model $\mathbf{s}_\theta(\mathbf{x}, i)$ with U-Net skip connections [17, 19].
Apply exponential moving average on the weights of the score-based model when used at test time [18, 19].

With such best practices, we are able to generate high quality image samples with comparable quality to GANs on various datasets, such as below: samples from the NCSNv2 [18] model. From the left to right: FFHQ 256x256, LSUN bedroom 128x128, LSUN tower 128x128, LSUN church_outdoor 96x96, and CelebA 64x64. ncsnv2

NorbertZheng commented 2 years ago

Score-based generative modeling with stochastic differential equations (SDEs)

As we already discussed, adding multiple noise scales is critical to the success of score-based generative models. By

generalizing the number of noise scales to infinity [20],

we obtain

higher quality samples,
exact log-likelihood computation,
controllable generation for inverse problem solving.

In addition to this introduction, we have tutorials written in Google Colab to provide a step-by-step guide for training a toy model on MNIST. We also have more advanced code repositories that provide full-fledged implementations for large-scale applications.	Link	Description
	Tutorial of score-based generative modeling with SDEs in JAX + FLAX
	Load our pretrained checkpoints and play with sampling, likelihood computation, and controllable synthesis (JAX + FLAX)
	Tutorial of score-based generative modeling with SDEs in PyTorch
	Load our pretrained checkpoints and play with sampling, likelihood computation, and controllable synthesis (PyTorch)
Code in JAX	Score SDE codebase in JAX + FLAX
Code in PyTorch	Score SDE codebase in PyTorch

NorbertZheng commented 2 years ago

Perturbing data with an SDE

When the number of noise scales approaches infinity, we essentially perturb the data distribution with continuously growing levels of noise. In this case, the noise perturbation procedure is a continuous-time stochastic process, as demonstrated below, perturbing data to noise with a continuous-time stochastic process. perturb_vp1

How can we represent a stochastic process in a concise way? Many stochastic processes (diffusion processes in particular) are solutions of stochastic differential equations (SDEs). In general, an SDE processes the following form:

$$ \begin{align} \mathrm{d}\mathbf{x} = \mathbf{f}(\mathbf{x}, t) \mathrm{d}t + g(t) \mathrm{d} \mathbf{w}, \end{align} $$

, where $\mathbf{f}(\cdot, t): \mathbb{R}^d \to \mathbb{R}^d$ is a vector-valued function called the drift coefficient, $g(t)\in \mathbb{R}$ is a real-valued function called the diffusion coefficient, $\mathbf{w}$ denotes a standard Brownian motion, and $\mathrm{d} \mathbf{w}$ can be viewed as infiniteimal white noise. The solution of a stochastic differential equation is a continuous collection of random variables { $\mathbf{x}(t)$ } $_{t \in [0, T]}$. These random variables trace stochastic trajectories as the time index $t$ grows from the start time $0$ to the end time $T$. Let $p{t}(x)$ denotethe (marginal) probability density function of $\mathbf{x}(t)$. Here $t \in [0, T]$ is analogous to $i = 1, 2, \cdots, L$ when we had a finite number of noise scales, and $p{t}(x)$ is analogous to $p_{\sigma_i}(\mathbf{x})$. Clearly, $p_0(\mathbf{x}) = p(\mathbf{x})$ is the data distribution since no perturbation is applied to data at $t=0$.

After perturbing $p(x)$ with the stochastic process for a sufficiently long time $T$, e.g. sufficiently large noise scale, $p_{T}(x)$ becomes close to a tractable noise distribution $\pi(x)$, called a prior distribution.

We note that $p{T}(x)$ is analogous to $p{\sigmaL}(\mathbf{x})$ in the case of finite noise scales, which corresponds to applying the largest noise perturbation $\sigma{L}$ to the data.

The above SDE is hand designed, similarly to how we hand-designed $\sigma_1 < \sigma_2 < \cdots < \sigma_L$ in the case of finite noise scales. There are numerous ways to add noise perturbations, and the choice of SDEs is not unique. For example, the following SDE

$$ \begin{align} \mathrm{d}\mathbf{x} = e^{t} \mathrm{d} \mathbf{w} \end{align} $$

perturbs data with Gaussian noise of mean zero and exponentially growing variance, which is analogous to perturbing data with $\mathcal{N}(0, \sigma_1^2 I), \mathcal{N}(0, \sigma_2^2 I), \cdots, \mathcal{N}(0, \sigma_L^2 I)$ when $\sigma_1 < \sigma_2 < \cdots < \sigma_L$ is a geometric progression. Therefore, the SDE should be viewed as part of the model, much like { $\sigma_1, \sigma_2, \cdots, \sigma_L$ }. in [20], we provide three SDEs that generally work well for images:

the Variance Exploding SDE (VE SDE),
the Variance Preserving SDE (VP SDE),
the sub-VP SDE.

NorbertZheng commented 2 years ago

Reversing the SDE for sample generation

Recall that with a finite number of noise scales, we can generate samples by reversing the perturbation process with annealed Langevin dynamics, i.e. sequentially sampling from each noise-perturbed distribution using Langevin dynamics. For infinite noise scales, we can analogously reverse the perturbation for sampling generation by using the reverse SDE.

Generate data from noise by reversing the perturbation procedure. denoise_vp1

Importantly, any SDE has a corresponding reverse SDE [[34]()], whose closed form is given by

$$ \begin{equation} \mathrm{d}\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g^2(t) \nabla_\mathbf{x} \log p_t(\mathbf{x})]\mathrm{d}t + g(t) \mathrm{d} \mathbf{w}. \end{equation} $$

Here $dt$ represents a negative infinitesimal time step, since the above SDE needs to be solved backwards in time (from $t=T$ to $t=0$).

In order to compute the reverse SDE, we need to estimate $\nabla{x}logp{t}(x)$, which is exactly the score function of $p_{t}(x)$.

Solving a reverse SDE yields a score-based generative model. Transforming data to a simple noise distribution can be accomplished with an SDE. It can be reversed to generate samples from noise if we know the score of the distribution at each intermediate time step. sde_schematic

NorbertZheng commented 2 years ago

Estimating the reverse SDE with score-based models and score matching

Solving the reverse SDE requires us to know the terminal distribution $p{T}(x)$, and the score function $\nabla{x}logp{t}(x)$. By design, the former is close to the prior distribution $\pi(x)$ which is fully tractable. In order to estimate $\nabla{x}logp{t}(x)$, we train a Time-Dependent Score-Based Model $s{\theta}(x,t)$, such that $\mathbf{s}_\theta(\mathbf{x}, t) \approx \nabla_\mathbf{x} \log pt(\mathbf{x})$. This is analogous to the noise-conditional score-based model $s{\theta}(x,i)$ used for finite noise scales, trained such that $\mathbf{s}_\theta(\mathbf{x}, i) \approx \nabla\mathbf{x} \log p{\sigma_i}(\mathbf{x})$.

Our training objective for $\mathbf{s}_\theta(\mathbf{x}, t)$ is a continuous weighted combination of Fisher divergences, given by

$$ \begin{equation} \mathbb{E}_{t \in \mathcal{U}(0, T)}\mathbb{E}_{pt(\mathbf{x})}[\lambda(t) | \nabla\mathbf{x} \log pt(\mathbf{x}) - \mathbf{s}\theta(\mathbf{x}, t) |_2^2], \end{equation} $$

, where $\mathcal{U}(0, T)$ denotes a uniform distribution over the time interval $[0,T]$, and $\lambda: \mathbb{R} \to \mathbb{R}_{>0}$ is a positive weighting function. Typically we use $\lambda(t) \propto 1/ \mathbb{E}[| \nabla_{\mathbf{x}(t)} \log p(\mathbf{x}(t) \mid \mathbf{x}(0))|_2^2]$ to balance the magnitude of different score matching losses across time.

As before, our weighted combination of Fisher divergences can be efficiently optimized with score matching methods, such as denoising score matching, and sliced score matching. Once our score-based modeel $\mathbf{s}_\theta(\mathbf{x}, t)$ is trained to optimality, we can plug it into the expression of the reverse SDE to obtain an estimated reverse SDE.

$$ \begin{equation} \mathrm{d}\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g^2(t) \mathbf{s}_\theta(\mathbf{x}, t)]\mathrm{d}t + g(t) \mathrm{d} \mathbf{w}. \end{equation} $$

We can start with $\mathbf{x}(T) \sim \pi$, and solve the above reverse SDE to obtain a sample $x(0)$. Let us denote the distribution of $x(0)$ obtained in such way as $p\theta$. When the score-based model $\mathbf{s}_\theta(\mathbf{x}, t)$ is well-trained, we have $p\theta \approx p0$, in which case $x(0)$ is an approximate sample from the data distribution $p\theta$.

When $\lambda(t) = g^2(t)$, we have an important connection between our weighted combination of Fisher divergences and the KL divergence from $p0$ to $p\theta$ under some regularity conditions [35]:

$$ \operatorname{KL}(p0(\mathbf{x})|p\theta(\mathbf{x})) \leq \frac{T}{2}\mathbb{E}_{t \in \mathcal{U}(0, T)}\mathbb{E}_{pt(\mathbf{x})}[\lambda(t) | \nabla\mathbf{x} \log pt(\mathbf{x}) - \mathbf{s}\theta(\mathbf{x}, t) |_2^2] + \operatorname{KL}(p_T \mathrel| \pi). $$

Due to this special connection to the KL divergence and the equivalence between minimizing KL divergences and maximizing likelihood for model training, we call $\lambda(t) = g(t)^2$ the likelihood weighting function. Using this likelihood weighting function, we can train score-based generative models to achieve very high likelihoods, comparable or even superior to state-of-the-art autoregressive models [35].

NorbertZheng commented 2 years ago

How to solve the reverse SDE

By solving the estimated reverse SDE with numerical SDE solvers, we can simulate the reverse stochastic process for sample generation. Perhaps the simplest numerical SDE solver is the Euler-Maruyama method. When applied to our estimated reverse SDE, it discretizes the SDE using finite time steps and small Gaussian noise. Specifically, it chooses a small negative time step $\Delta t \approx 0$, initializes $t \leftarrow T$, and iterates the following procedure until $t \approx 0$:

$$ \begin{aligned} \Delta \mathbf{x} &\gets [\mathbf{f}(\mathbf{x}, t) - g^2(t) \mathbf{s}_\theta(\mathbf{x}, t)]\Delta t + g(t) \sqrt{\vert \Delta t\vert }\mathbf{z}_t \ \mathbf{x} &\gets \mathbf{x} + \Delta \mathbf{x}\ t &\gets t + \Delta t, \end{aligned} $$

Here $\mathbf{z}_t \sim \mathcal{N}(0, I)$. The Euler-Maruyama method is qualitatively similar to Langevin dynamics - both update $\mathbf{x}$ by the following score functions perturbed with Gaussian noise.

Aside from the Euler-Maruyama method, other numerical SDE solvers can be directly employed to solve the reverse SDE for sample generation, including, for example, Milstein method, and stochastic Runge-Kutta methods. In [20], we provided a reverse diffusion solver similar to Euler-Maruyama, but more tailored for solving reverse-time SDEs. More recently, authors in [36] introduced adaptive step-size SDE solvers that can generate samples faster with better quality.

NorbertZheng commented 2 years ago

In addition, there are two special properties of our reverse SDE that allow for even more flexible sampling methods:

We have an estimate of $\nabla{x}logp{t}(x)$ via our time-dependent score-based model $\mathbf{s}_\theta(\mathbf{x}, t)$.
We only care about sampling from each marginal distribution $p_{t}(x)$. Samples obtained at different time steps can have arbitrary correlations and do not have to form a particular trajectory sampled from the reverse SDE, e.g. we only care about the distribution of each time step, instead of the process.

As a consequence of these two properties, we can apply MCMC approaches to fine-tune the trajectories obtained from numerical SDE solvers. After all, the second property enables the rationality of the fine-tuning, and the first property enables the execution of the fine-tuning. Specifically, we propose Predictor-Corrector samplers.

The predictor can be any numerical SDE solver that predicts $\mathbf{x}(t + \Delta t) \sim p_{t+\Delta t}(\mathbf{x})$ from an existing sample $\mathbf{x}(t) \sim p_t(\mathbf{x})$.
The corrector can be any MCMC procedure that solely relies on the score function, such as Langevin dynamics and Hamiltonian Monte Carlo.

At each step of the Predictor-Corrector sampler, we first use the predictor to choose a proper step size $\Delta t < 0$, and then predict $\mathbf{x}(t + \Delta t)$ based on the current sample $\mathbf{x}(t)$. Next, we run several corrector steps to improve the sample $\mathbf{x}(t + \Delta t)$ according to our score-based model $\mathbf{s}_\theta(\mathbf{x}, t + \Delta t)$, so that $\mathbf{x}(t + \Delta t)$ becomes a higher-quality sample from $p_{t+\Delta t}(\mathbf{x})$.

With Predictor-Corrector methods and better architectures of score-based models, we can achieve state-of-the-art sample quality on CIFAR-10 (measured in FID [37] and Inception scores [12]), outperforming the best GAN model to date (StyleGAN2 + ADA [38]).	Method	FID $\downarrow$	Inception score $\uparrow$
StyleGAN2 + ADA [38]	2.92	9.83
Ours [20]	2.20	9.89

The sampling methods are also scalable for extremely high-dimensional data. For example, it can successfully generate high-fidelity images of resolution 1024 x 1024. 1024 x 1024 samples from a score-based model trained on the FFHQ dataset. ffhq_1024

NorbertZheng commented 2 years ago

Probability flow ODE

Despite being capable of generating high-quality samples,

samplers based on Langevin MCMC and SDE solvers do not provide a way to compute the exact log-likelihood of score-based generative models.

Below, we introduce a sampler based on ordinary differential equations (ODEs) that allow for exact likelihood computation.

In [20], we show it is possible to convert any SDE into an ordinary differential equation (ODE) without changing its marginal distributions { $p{t}(x)$ } ${t \in [0,T]}$. Thus by solving this ODE, we can sample from the same distributions as the reverse SDE. The corresponding OSE of an SDE is named probability flow ODE [20], given by

$$ \begin{equation} \mathrm{d} \mathbf{x} = \bigg[\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}g^2(t) \nabla_\mathbf{x} \log p_t(\mathbf{x})\bigg] \mathrm{d}t. \end{equation} $$

The following figure depicts trajectories of both SDEs and probability flow ODEs. Although ODE trajectories are noticeably smoother than SDE trajectories, they convert the same data distribution to the same prior distribution and vice versa, sharing the same set of marginal distributions { $p{t}(x)$ } ${t \in [0,T]}$. In other words, trajectories obtained by solving the probability flow ODE have the same marginal distributions as the SDE trajectories.

We can map data to a noise distribution (the prior) with an SDE, and reverse this SDE for generative modeling. We can also reverse the associated probability flow ODE, which yields a deterministic process that samples from the same distribution as the SDE. Both the reverse-time SDE and probability flow ODE can be obtained by estimating score functions. teaser

This probability flow ODE formulation has several unique advantages.

When $\nabla_\mathbf{x} \log pt(\mathbf{x})$ is replaced by its approximation $\mathbf{s}\\theta(\mathbf{x}, t)$, the probability flow ODE becomes a special case of a neural ODE. In particular, it is an example of continuous normalizing flows, since the probability flow ODE converts a data distribution $p_0(\mathbf{x})$ to a prior noise distribution $p_T(\mathbf{x})$ (since it shares the same marginal distributions as the SDE) and is fully invertible.

As such, the probability flow ODE inherits all properties of neural ODEs or continuous normalizing flows, including exact log-likelihood computation.

Specifically, we can leverage the instantaneous change-of-variable formula (Theorem 1 in [39], Equation (4) in [40]) to compute the unknown data density $p_0$ from the known prior density $p_T$ with numerical ODE solvers.

In fact, our model achieves the state-of-the-art log-likelihoods on uniformly dequantized CIFAR-10 images [20], even without maximum likelihood training. It is typical for normalizing flow models to convert discrete images to continuous ones by adding small uniform noise to them.	Method	Negative log-likelihood (bits/dim) $\downarrow$
RealNVP	3.49
iResNet	3.45
Glow	3.35
FFJORD	3.40
Flow++	3.29
Ours	2.99

When training score-based models with the likelihood weighting we discussed before, and using variational dequantization to obtain likelihoods on discrete images, we can achieve comparable or even superior likelihood to the state-of-the-art autoregressive models (all without any data augmentation) [35].	Method	Negative log-likelihood (bits/dim) $\downarrow$ on CIFAR-10
Sparse Transformer	2.80	-
Image Transformer	2.90	3.77
Ours	2.83	3.76

NorbertZheng commented 2 years ago

Controllable generation for inverse problem solving

Score-based generative models are particularly suitable for solving inverse problems. At its core,

inverse problems are same as Bayesian inference problems.

Let $\mathbf{x}$ and $\mathbf{y}$ be two random variables, and suppose we know the forward process of generating $\mathbf{y}$ from $\mathbf{x}$, represented by the transition probability distribution $p(\mathbf{y} \mid \mathbf{x})$. The inverse problem is to compute $p(\mathbf{x} \mid \mathbf{y})$. From Bayes’ rule, we have

$$ p(\mathbf{x} \mid \mathbf{y}) = p(\mathbf{x}) p(\mathbf{y} \mid \mathbf{x}) / \int p(\mathbf{x}) p(\mathbf{y} \mid \mathbf{x}) \mathrm{d} \mathbf{x} $$

This expression can be greatly simplified by taking gradients with respect to $\mathbf{x}$ on both sides, leading to the following Bayes’ rule for score functions:

$$ \begin{equation} \nabla\mathbf{x} \log p(\mathbf{x} \mid \mathbf{y}) = \nabla\mathbf{x} \log p(\mathbf{x}) + \nabla_\mathbf{x} \log p(\mathbf{y} \mid \mathbf{x}). \end{equation} $$

Through score matching, we can train a model to estimate the score function of the unconditional data distribution, i.e., $\mathbf{s}_\theta(\mathbf{x}) \approx \nabla\mathbf{x} \log p(\mathbf{x})$. This will allow us to easily compute the posterior score function $\nabla\mathbf{x} \log p(\mathbf{x} \mid \mathbf{y})$ from the known forward process $p(\mathbf{y} \mid \mathbf{x})$ via the above equation, and sample from it with Langevin-type sampling [20].

A recent work from UT Austin [28] has emonstrated that score-based generative models can be applied to solving inverse problems in medical imaging, such as accelerating magnetic resonance imaging (MRI). Concurrently in [41], we demonstrated superior performance of score-based generative models not only on accelerated MRI, but also sparse-view computed tomography (CT). We were able to achieve comparable or even better performance than supervised or unrolled deep learning approaches, while being more robust to different measurement processes at test time.

Below we show some examples on solving inverse problems for computer vision.

Class-conditional generation with an unconditional time-dependent score-based model, and a pre-trained noise-conditional image classifier on CIFAR-10.
Image inpainting with a time-dependent score-based model trained on LSUN bedroom. The leftmost column is ground-truth. The second column shows masked images (y in our framework). The rest columns show different inpainted images, generated by solving the conditional reverse-time SDE.
Image colorization with a time-dependent score-based model trained on LSUN church_outdoor and bedroom. The leftmost column is ground-truth. The second column shows gray-scale images (y in our framework). The rest columns show different colorizedimages, generated by solving the conditional reverse-time SDE.
We can even colorize gray-scale portrays of famous people in history (Abraham Lincoln) with a time-dependent score-based model trained on FFHQ. The image resolution is 1024 x 1024.

NorbertZheng commented 2 years ago

Connection to diffusion models and others

I started working on score-based generative modeling since 2019, when I was trying hard to make score matching scalable for training deep energy-based models on high-dimensional datasets. My first attempt at this led to the method sliced score matching. Despite the scalability of sliced score matching for training energy-based models, I found to my surprise that Langevin sampling from those models fails to produce reasonable samples even on the MNIST dataset. I started investigating this issue and discovered three crucial improvements that can lead to extremely good samples:

perturbing data with multiple scales of noise, and training score-based models for each noise scale;
using a U-Net architecture (we used RefineNet since it is a modern version of U-Nets) for the score-based model;
applying Langevin MCMC to each noise scale and chaining them together.

With those methods, I was able to obtain the state-of-the-art Inception Score on CIFAR-10 in [17] (even better than the best GANs!), and generate high-fidelity image samples of resolution up to $256 × 256$ in [18].

The idea of perturbing data with multiple scales of noise is by no means unique to score-based generative models though. It has been previously used in, for example, simulated annealing, [annealed importance sampling](), diffusion probabilistic models, infusion training, and variational walkback for generative stochastic networks.

Out of all these works, diffusion probabilistic modeling is perhaps the closest to score-based generative modeling. Diffusion probabilistic models are hierachical latent variable models first proposed by Jascha and its colleagues [43] in 2015, which generate samples by learning a variational decoder to reverse a discrete diffusion process that perturbs data to noise. Without awareness of this work, score-based generative modeling was proposed and motivated independently from a very different perspective. Despite both perturbing data with multiple scales of noise, the connection between score-based generative modeling and diffusion probabilistic modeling seemed superficial at that time, since

the former is trained by score matching and sampled by Langevin dynamics,
the latter is trained by the evidence lower bound (ELBO) and sampled with a learned decoder.

In 2020, Jonathan Ho and colleagues [19] significantly improved the empirical performance of diffusion probabilistic models and first unveiled a deeper connection to score-based generative modeling. They showed that

the ELBO used for training diffusion probabilistic models is essentially equivalent to the weighted combination of score matching objectives used in score-based generative modeling.

Moreover, by parameterizing the decoder as a sequence of score-based models with a U-Net architecture, they demonstrated for the first time that diffusion probabilistic models can also generate high quality image samples comparable or superior to GANs.

Inspired by their work, we further investigated the relationship between diffusion models and score-based generative models in an ICLR 2021 paper [20]. We found that

the sampling method of diffusion probabilistic models can be integrated with annealed Langevin dynamics of score-based models to create a unified and more powerful sampler (the Predictor-Corrector sampler).

By generalizing the number of noise scales to infinity, we further proved that score-based generative models and diffusion probabilistic models can both be viewed as discretizations to stochastic differential equations determined by score functions. This work bridges both score-based generative modeling and diffusion probabilistic modeling into a unified framework.

Collectively, these latest developments seem to indicate that both score-based generative modeling with multiple noise perturbations and diffusion probabilistic models are different perspectives of the same model family, much like how wave machanics and matrix mechanics are equivalent formulations of quantum mechanics in the history of physics. Goes without saying that the significance of score-based generative models/diffusion probabilistic models is in no way comparable to quantum mechanics.

The perspective of score matching and score-based models allows one to

calculate log-likelihoods exactly,
solve inverse problems naturally,
is directly connected to energy-based models, Schrödinger bridges and optimal transport [47].

The perspective of diffusion models is naturally connected to VAEs, lossy compression, and can be directly incorporated with variational probabilistic inference.

This blog post focuses on the first perspective, but I highly recommend interested readers to learn about the alternative perspective of diffusion models as well (see a great blog by Lilian Weng).

Many recent works on score-based generative models or diffusion probabilistic models have been deeply influenced by knowledge from both sides of research (see a website curated by researchers at the University of Oxford). Despite this deep connection between score-based generative models and diffusion models, it is hard to come up with an umbrella term for the model family that they both belong to. Some colleagues in DeepMind propose to call them “Generative Diffusion Processes”. It remains to be seen if this will be adopted by the community in the future.

NorbertZheng commented 2 years ago

Concluding remarks

This blog post gives a detailed introduction to score-based generative models. We demonstrate that this new paradigm of generative modeling is able to

produce high-quality samples,
compute exact log-likelihoods,
perform controllable generation for inverse problem-solving.

It is a compilation of several papers we published in the past few years. Please visit them if you are interested in more details:

For a list of works that have been influenced by score-based generative modeling, researchers at the University of Oxford have built a very useful (but necessarily incomplete) website: https://scorebasedgenerativemodeling.github.io/.

There are two major challenges of score-based generative models. First, the sampling speed is slow since it involves a large number of Langevin-type iterations. Second, it is inconvenient to work with discrete data distributions since scores are only defined on continuous distributions.

The first challenge can be partially solved by using numerical ODE solvers for the probability flow ODE with lower precision (a similar method, denoising diffusion implicit modeling, has been proposed in [48]). It is also possible to learn a direct mapping from the latent space of probability flow ODEs to the image space, as shown in [49]. However, all such methods to date result in worse sample quality.

The second challenge can be addressed by learning an autoencoder on discrete data and performing score-based generative modeling on its continuous latent space [27, 50]. Jascha’s original work on diffusion models [43] also provides a discrete diffusion process for discrete data distributions, but its potential for large-scale applications remains yet to be proven.

It is my conviction that these challenges will soon be solved with the joint efforts of the research community, and score-based generative models/ diffusion-based models will become one of the most useful tools for data generation, density estimation, inverse problem solving, and many other downstream tasks in machine learning.