UAI '20 | Sliced score matching: A scalable approach to density and score estimation.

NorbertZheng commented 2 years ago

Song Y, Garg S, Shi J, et al. Sliced score matching: A scalable approach to density and score estimation.

NorbertZheng commented 2 years ago

Related Reference

Hyvärinen A, Dayan P. Estimation of non-normalized statistical models by score matching.
Martens J, Sutskever I, Swersky K. Estimating the hessian by back-propagating curvature.
Kingma D P, Welling M. Auto-encoding variational bayes.
Tolstikhin I, Bousquet O, Gelly S, et al. Wasserstein auto-encoders.
Song Y, Ermon S. Generative modeling by estimating gradients of the data distribution.

NorbertZheng commented 2 years ago

Overview

An overview for our UAI 2019 paper on Sliced Score Matching. We show

how to use random projection to scale up score matching - a classic method to learn unnormalized probabilistic models - to high-dimensional data.

Theoretically, sliced score matching produces a consistent and asymptotic normal estimator under some regularity conditions. We apply sliced score matching to training deep energy-based models, learning VAEs with implicit encoders, and training Wasserstein Auto-Encoders (WAEs).

NorbertZheng commented 2 years ago

Unnormalized probability models

Let $f{\theta}(x)$ be a real-valued function on $x \in \mathbb{R}^{D}$. We can define a probability model from $f{\theta}(x)$, using the following formula

$$ p{\theta}(x)=\frac{e^{-f{\theta}(x)}}{Z_{\theta}}. $$

Here we assume that

$$ Z{\theta}=\int e^{-f{\theta}(x)}dx $$

exists and call it the partition function. Typically, it is intractable to evaluate the partition function, and $Z{\theta}$ is an unknown quantity. The probability model $p{\theta}(x)$ is called an unnormalized probability model, because the normalizing constant $Z_{\theta}$ is unknown.

NorbertZheng commented 2 years ago

Learning unnormalized models with score matching

The de facto standard for learning a probability model is the maximum likelihood (MLE). However, it is not suitable for learning unnormalized probability models, because the likelihood function

$$ \log p\theta(\mathbf{x}) = -f\theta(\mathbf{x}) - \log Z_\theta $$

depends on the intractable partition function $Z_{\theta}$.

Score matching bypasses the intractable $Z_{\theta}$ by only considering the scores, which is defined as

the gradient of the log-density w.r.t. the random variable, e.g. data, instead of parameter.

For example, the score of $p_{\theta}(x)$ is

$$ \nabla{x}logp{\theta}(x)=-\nabla{x}f{\theta}(x). $$

Note that the score of an unnormalized probability model $\nabla{x}logp{\theta}(x)$ does not depend on the intractable partition function $Z{\theta}$. Instead of using likelihood $logp{\theta}(x)$ as the objective, score matching is based on the following objective

$$ \frac{1}{2}\mathbb{E{p{data}}}\left[\vert\vert\nabla{x}logp{data}(x)-\nabla{x}logp{\theta}(x)\vert\vert_{2}^{2}\right] $$

, where $p{data}(x)$ is the data distribution. The above objective is widely known as the Fisher divergence. Since it only involves $\nabla{x}logp{\theta}(x)$ and does not require $Z{\theta}$, it is ideal for training unnormalized models.

However, Fisher divergence is not directly computable, because

the score of the data distribution $\nabla{x}logp{data}(x)$ is unknown.

Score matching eliminates the data score using integration by parts. To simplify our discussion, we consider the Fisher divergence between distributions of 1-D random variables. We have

$$ \begin{aligned} &\frac{1}{2}\mathbb{E}{p\text{data}}[(\nablax \log p\text{data}(x) - \nablax \log p\theta(x))^2] \ =& \frac{1}{2} \int p_\text{data}(x) (\nablax \log p\text{data}(x) - \nablax \log p\theta(x))^2 \text{d}x\ =& \underbrace{\frac{1}{2} \int p_\text{data}(x) (\nablax \log p\text{data}(x))^2 \text{d}x}_{\text{const}} + \frac{1}{2} \int p_\text{data}(x) (\nablax \log p\theta(x))^2 \text{d} x \ &- \int p_\text{data}(x) \nablax \log p\theta(x) \nablax \log p\text{data}(x)\text{d}x. \end{aligned} $$

By integration by parts, we have

$$ \begin{aligned} &- \int p_\text{data}(x) \nablax \log p\theta(x) \nablax \log p\text{data}(x) \text{d}x\ =& - \int \nablax \log p\theta(x) \nablax p\text{data}(x)\text{d} x\ =& - p_\text{data}(x) \nablax \log p\theta(x)\bigg|^{\infty}_{-\infty} + \int p_\text{data}(x) \nablax^2 \log p\theta(x) \text{d} x\ \stackrel{(i)}{=}& \mathbb{E}{p\text{data}}[\nablax^2 \log p\theta(x)], \end{aligned} $$

where $(i)$ holds if we assume $p_{data}(x) \to 0$ when $\vert x \vert \rightarrow \infty$. Now substituting the results of integration by parts into the 1-D Fisher divergence, we obtain

$$ \begin{aligned} &\frac{1}{2}\mathbb{E}_{p_\text{data}}[(\nablax \log p\text{data}(x) - \nablax \log p\theta(x))^2]\ =& \mathbb{E}_{p_\text{data}}[\nablax^2 \log p\theta(x)] + \frac{1}{2} \mathbb{E}_{p_\text{data}}[(\nablax \log p\theta(x))^2] + \text{const}. \end{aligned} $$

Therefore, the equivalent form of 1-D Fisher divergence does not involve $\nabla{x}logp{data}(x)$.

Generalizing the integration by parts argument to multi-dimension data, we have the following object equivalent to Fisher divergence (see here for details of proof).

$$ \mathbb{E}_{p\text{data}}\bigg[\operatorname{tr}( \nabla{\mathbf{x}}^2 \log p\theta(\mathbf{x})) + \frac{1}{2} | \nabla\mathbf{x} \log p_\theta(\mathbf{x})|_2^2 \bigg] + \text{const}, $$

, where $\nabla{x}^{2}$ donates the Hessian with respect to $x$. This objective is known as the score-matching objective. Since it only involves functions of $\nabla{x}logp{\theta}(x)$, it does not depend on the intractable partition function $Z{\theta}$ and therefore is ideal for learning unnormalized probability models.

NorbertZheng commented 2 years ago

Sliced score matching

So far, we know that score matching can be used to learn unnormalized models. We are particularly interested in learning deep energy-based models, a special kind of unnormalized models where $f_{\theta}(x)$ is parameterized by a deep neural network. In order to use score matching for learning deep energy-based models, we have to compute

$$ | \nabla\mathbf{x} \log p\theta(\mathbf{x})|2^2 = | \nabla\mathbf{x} f_\theta(\mathbf{x})|^2_2 $$

, and

$$ \operatorname{tr}( \nabla{\mathbf{x}}^2 \log p\theta(\mathbf{x})) = -\operatorname{tr}( \nabla{\mathbf{x}}^2 f\theta(\mathbf{x})) $$

The former can be computed by one simple backpropagation of $f{\theta}(x)$. The latter, however, requires much more backpropagations to compute. As studied in previous work, computing $\operatorname{tr}( \nabla{\mathbf{x}}^2 f_\theta(\mathbf{x}))$ requires a number of backpropagation that is proportional to the data dimension $D$. Therefore, score matching is not scalable when learning deep energy-based models on high-dimensional data.

NorbertZheng commented 2 years ago

We propose sliced score matching to greatly scale up the computation of score matching. The motivating idea is that one-dimensional data distribution is much easier to estimate for score matching. We propose to

project the scores onto random directions, such that the vector fields of scores of the data and model distribution become scalar fields, e.g. force all vectors onto one dimension?

We then compare the scalar fields to determine how far the model distribution is from the data distribution. It is clear to see that the two vector fields are equivalent if and only if their scalar fields corresponding to projections onto all directions are the same. So we can project them onto many directions, e.g. sampleing?

Mathematically, we denote $v$ as the random projection disrection, and $p_{v}$ as its distribution. The random projected version of Fisher divergence is

$$ \frac{1}{2}\mathbb{E}_{p\text{data}}[(\mathbf{v}^\intercal \nabla\mathbf{x} \log p\text{data}(\mathbf{x}) - \mathbf{v}^\intercal \nabla\mathbf{x} \log p_\theta(\mathbf{x}) )^2], $$

, which we name as the sliced Fisher divergence. Unfortunately, sliced Fisher divergence has the same problem as Fisher divergence, due to the unknown data score function $\nabla{x}logp{data}(x)$. We therefore play the same trick of integration by parts, as done in score matching, to obtain the following tractable alternative form

$$ \mathbb{E}_{p\text{data}}\bigg[\mathbf{v}^\intercal \nabla{\mathbf{x}}^2 \log p\theta(\mathbf{x})\mathbf{v} + \frac{1}{2} (\mathbf{v}^\intercal\nabla\mathbf{x} \log p_\theta(\mathbf{x}))^2 \bigg] + \text{const}, $$

, which is our sliced score matching objective. Again, $\mathbf{v}^\intercal\nabla\mathbf{x} \log p\theta(\mathbf{x})$ can be computed by one backpropagation for deep energy-based models. The first item of the objective again involves Hessian, but it is in the form of Hessian-vector products, which can be computed with $\mathcal{O}(1)$ backpropagations. Therefore, the computation of sliced score matching does not depend on the dimension of data, and is much more scalable for training deep energy-based models on high dimensional datasets.

NorbertZheng commented 2 years ago

Theoretical guarantees of learning with sliced score matching

Suppose our data points { $\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N$ } are i.i.d. samples from the data distribution $p{data}$. For each data point $\mathbf{x}_i$, we randomly draw $M$ random projection directions { $\mathbf{v}_{i1}, \mathbf{v}_{i2}, \cdots, \mathbf{v}_{iM}$ } $\sim p\mathbf{v}$. The sliced score matching objective can be estimated with empirical averages, giving rise to the following finite-sample estimator:

$$ \frac{1}{NM} \sum{i=1}^N \sum{j=1}^M \bigg[\mathbf{v}{ij}^\intercal \nabla{\mathbf{x}}^2 \log p_\theta(\mathbf{x}i)\mathbf{v}{ij} + \frac{1}{2} (\mathbf{v}{ij}^\intercal\nabla\mathbf{x} \log p_\theta(\mathbf{x}_i))^2 \bigg] $$

We denote $\hat{\theta}_{N,M}$ as the minimizer of the above empirical estimator, and let $\theta^{}$ denote the true parameter corresponding to the data distribution such that $p_{\theta^{}}=p{data}$. We can prove that under some regularity conditions, $\hat{\theta}\{N,M}$ is consistent and asymptotically normal. More rigorously, for any $M \in \mathbb{N}^{+}$, when $N \rightarrow \infty$, we have

$$ \hat{\theta}_{N,M} \stackrel{p}{\rightarrow} \theta^* $$

and

$$ \sqrt{N}(\hat{\theta}_{N,M} - \theta^*) \stackrel{d}{\rightarrow} \mathcal{N}(0,\Sigma) $$

, where $\Sigma$ is some covariance matrix.

NorbertZheng commented 2 years ago

Experiments

Sliced score matching has numerous applications. For example,

Density estimation. It can be used to scalably train deep energy-based models on high-dimensional data.
Score estimation. The sliced matching objective can be used to estimate the score of any distribution from which samples can be efficiently obtained, which we call score estimation. It can be used to learn Variational Auto-Encoders (VAEs) where the encoder is an implicit distribution. It can also be used to learn Wasserstein Auto-Encoders (WAEs).
Score-based generative modeling. It can be used to estimate scores of the data distribution and produce samples directly with Langevin dynamics. After all, we already get the gradient field of $p_{\theta}(x)$ to $x$.

NorbertZheng / read-papers