NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

UAI '20 | Sliced score matching: A scalable approach to density and score estimation. #32

Closed NorbertZheng closed 2 years ago

NorbertZheng commented 2 years ago

Song Y, Garg S, Shi J, et al. Sliced score matching: A scalable approach to density and score estimation.

NorbertZheng commented 2 years ago

Related Reference

NorbertZheng commented 2 years ago

Overview

An overview for our UAI 2019 paper on Sliced Score Matching. We show

Theoretically, sliced score matching produces a consistent and asymptotic normal estimator under some regularity conditions. We apply sliced score matching to training deep energy-based models, learning VAEs with implicit encoders, and training Wasserstein Auto-Encoders (WAEs).

NorbertZheng commented 2 years ago

Unnormalized probability models

Let $f{\theta}(x)$ be a real-valued function on $x \in \mathbb{R}^{D}$. We can define a probability model from $f{\theta}(x)$, using the following formula

$$ p{\theta}(x)=\frac{e^{-f{\theta}(x)}}{Z_{\theta}}. $$

Here we assume that

$$ Z{\theta}=\int e^{-f{\theta}(x)}dx $$

exists and call it the partition function. Typically, it is intractable to evaluate the partition function, and $Z{\theta}$ is an unknown quantity. The probability model $p{\theta}(x)$ is called an unnormalized probability model, because the normalizing constant $Z_{\theta}$ is unknown.

NorbertZheng commented 2 years ago

Learning unnormalized models with score matching

The de facto standard for learning a probability model is the maximum likelihood (MLE). However, it is not suitable for learning unnormalized probability models, because the likelihood function

$$ \log p\theta(\mathbf{x}) = -f\theta(\mathbf{x}) - \log Z_\theta $$

depends on the intractable partition function $Z_{\theta}$.

Score matching bypasses the intractable $Z_{\theta}$ by only considering the scores, which is defined as

For example, the score of $p_{\theta}(x)$ is

$$ \nabla{x}logp{\theta}(x)=-\nabla{x}f{\theta}(x). $$

Note that the score of an unnormalized probability model $\nabla{x}logp{\theta}(x)$ does not depend on the intractable partition function $Z{\theta}$. Instead of using likelihood $logp{\theta}(x)$ as the objective, score matching is based on the following objective

$$ \frac{1}{2}\mathbb{E{p{data}}}\left[\vert\vert\nabla{x}logp{data}(x)-\nabla{x}logp{\theta}(x)\vert\vert_{2}^{2}\right] $$

, where $p{data}(x)$ is the data distribution. The above objective is widely known as the Fisher divergence. Since it only involves $\nabla{x}logp{\theta}(x)$ and does not require $Z{\theta}$, it is ideal for training unnormalized models.

However, Fisher divergence is not directly computable, because

Score matching eliminates the data score using integration by parts. To simplify our discussion, we consider the Fisher divergence between distributions of 1-D random variables. We have

$$ \begin{aligned} &\frac{1}{2}\mathbb{E}{p\text{data}}[(\nablax \log p\text{data}(x) - \nablax \log p\theta(x))^2] \ =& \frac{1}{2} \int p_\text{data}(x) (\nablax \log p\text{data}(x) - \nablax \log p\theta(x))^2 \text{d}x\ =& \underbrace{\frac{1}{2} \int p_\text{data}(x) (\nablax \log p\text{data}(x))^2 \text{d}x}_{\text{const}} + \frac{1}{2} \int p_\text{data}(x) (\nablax \log p\theta(x))^2 \text{d} x \ &- \int p_\text{data}(x) \nablax \log p\theta(x) \nablax \log p\text{data}(x)\text{d}x. \end{aligned} $$

By integration by parts, we have

$$ \begin{aligned} &- \int p_\text{data}(x) \nablax \log p\theta(x) \nablax \log p\text{data}(x) \text{d}x\ =& - \int \nablax \log p\theta(x) \nablax p\text{data}(x)\text{d} x\ =& - p_\text{data}(x) \nablax \log p\theta(x)\bigg|^{\infty}_{-\infty} + \int p_\text{data}(x) \nablax^2 \log p\theta(x) \text{d} x\ \stackrel{(i)}{=}& \mathbb{E}{p\text{data}}[\nablax^2 \log p\theta(x)], \end{aligned} $$

where $(i)$ holds if we assume $p_{data}(x) \to 0$ when $\vert x \vert \rightarrow \infty$. Now substituting the results of integration by parts into the 1-D Fisher divergence, we obtain

$$ \begin{aligned} &\frac{1}{2}\mathbb{E}_{p_\text{data}}[(\nablax \log p\text{data}(x) - \nablax \log p\theta(x))^2]\ =& \mathbb{E}_{p_\text{data}}[\nablax^2 \log p\theta(x)] + \frac{1}{2} \mathbb{E}_{p_\text{data}}[(\nablax \log p\theta(x))^2] + \text{const}. \end{aligned} $$

Generalizing the integration by parts argument to multi-dimension data, we have the following object equivalent to Fisher divergence (see here for details of proof).

$$ \mathbb{E}_{p\text{data}}\bigg[\operatorname{tr}( \nabla{\mathbf{x}}^2 \log p\theta(\mathbf{x})) + \frac{1}{2} | \nabla\mathbf{x} \log p_\theta(\mathbf{x})|_2^2 \bigg] + \text{const}, $$

, where $\nabla{x}^{2}$ donates the Hessian with respect to $x$. This objective is known as the score-matching objective. Since it only involves functions of $\nabla{x}logp{\theta}(x)$, it does not depend on the intractable partition function $Z{\theta}$ and therefore is ideal for learning unnormalized probability models.

NorbertZheng commented 2 years ago

Sliced score matching

So far, we know that score matching can be used to learn unnormalized models. We are particularly interested in learning deep energy-based models, a special kind of unnormalized models where $f_{\theta}(x)$ is parameterized by a deep neural network. In order to use score matching for learning deep energy-based models, we have to compute

$$ | \nabla\mathbf{x} \log p\theta(\mathbf{x})|2^2 = | \nabla\mathbf{x} f_\theta(\mathbf{x})|^2_2 $$

, and

$$ \operatorname{tr}( \nabla{\mathbf{x}}^2 \log p\theta(\mathbf{x})) = -\operatorname{tr}( \nabla{\mathbf{x}}^2 f\theta(\mathbf{x})) $$

The former can be computed by one simple backpropagation of $f{\theta}(x)$. The latter, however, requires much more backpropagations to compute. As studied in previous work, computing $\operatorname{tr}( \nabla{\mathbf{x}}^2 f_\theta(\mathbf{x}))$ requires a number of backpropagation that is proportional to the data dimension $D$. Therefore, score matching is not scalable when learning deep energy-based models on high-dimensional data.

NorbertZheng commented 2 years ago

We propose sliced score matching to greatly scale up the computation of score matching. The motivating idea is that one-dimensional data distribution is much easier to estimate for score matching. We propose to

We then compare the scalar fields to determine how far the model distribution is from the data distribution. It is clear to see that the two vector fields are equivalent if and only if their scalar fields corresponding to projections onto all directions are the same. So we can project them onto many directions, e.g. sampleing?

Mathematically, we denote $v$ as the random projection disrection, and $p_{v}$ as its distribution. The random projected version of Fisher divergence is

$$ \frac{1}{2}\mathbb{E}_{p\text{data}}[(\mathbf{v}^\intercal \nabla\mathbf{x} \log p\text{data}(\mathbf{x}) - \mathbf{v}^\intercal \nabla\mathbf{x} \log p_\theta(\mathbf{x}) )^2], $$

, which we name as the sliced Fisher divergence. Unfortunately, sliced Fisher divergence has the same problem as Fisher divergence, due to the unknown data score function $\nabla{x}logp{data}(x)$. We therefore play the same trick of integration by parts, as done in score matching, to obtain the following tractable alternative form

$$ \mathbb{E}_{p\text{data}}\bigg[\mathbf{v}^\intercal \nabla{\mathbf{x}}^2 \log p\theta(\mathbf{x})\mathbf{v} + \frac{1}{2} (\mathbf{v}^\intercal\nabla\mathbf{x} \log p_\theta(\mathbf{x}))^2 \bigg] + \text{const}, $$

, which is our sliced score matching objective. Again, $\mathbf{v}^\intercal\nabla\mathbf{x} \log p\theta(\mathbf{x})$ can be computed by one backpropagation for deep energy-based models. The first item of the objective again involves Hessian, but it is in the form of Hessian-vector products, which can be computed with $\mathcal{O}(1)$ backpropagations. Therefore, the computation of sliced score matching does not depend on the dimension of data, and is much more scalable for training deep energy-based models on high dimensional datasets.

NorbertZheng commented 2 years ago

Theoretical guarantees of learning with sliced score matching

Suppose our data points { $\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N$ } are i.i.d. samples from the data distribution $p{data}$. For each data point $\mathbf{x}_i$, we randomly draw $M$ random projection directions { $\mathbf{v}_{i1}, \mathbf{v}_{i2}, \cdots, \mathbf{v}_{iM}$ } $\sim p\mathbf{v}$. The sliced score matching objective can be estimated with empirical averages, giving rise to the following finite-sample estimator:

$$ \frac{1}{NM} \sum{i=1}^N \sum{j=1}^M \bigg[\mathbf{v}{ij}^\intercal \nabla{\mathbf{x}}^2 \log p_\theta(\mathbf{x}i)\mathbf{v}{ij} + \frac{1}{2} (\mathbf{v}{ij}^\intercal\nabla\mathbf{x} \log p_\theta(\mathbf{x}_i))^2 \bigg] $$

We denote $\hat{\theta}_{N,M}$ as the minimizer of the above empirical estimator, and let $\theta^{}$ denote the true parameter corresponding to the data distribution such that $p_{\theta^{}}=p{data}$. We can prove that under some regularity conditions, $\hat{\theta}\{N,M}$ is consistent and asymptotically normal. More rigorously, for any $M \in \mathbb{N}^{+}$, when $N \rightarrow \infty$, we have

$$ \hat{\theta}_{N,M} \stackrel{p}{\rightarrow} \theta^* $$

and

$$ \sqrt{N}(\hat{\theta}_{N,M} - \theta^*) \stackrel{d}{\rightarrow} \mathcal{N}(0,\Sigma) $$

, where $\Sigma$ is some covariance matrix.

NorbertZheng commented 2 years ago

Experiments

Sliced score matching has numerous applications. For example,