ermongroup / ncsn

Noise Conditional Score Networks (NeurIPS 2019, Oral)
GNU General Public License v3.0
676 stars 97 forks source link

A question on convergence of DSM #9

Open cheind opened 2 years ago

cheind commented 2 years ago

Hey,

I'm currently tracing out the story of diffusion generative models and right now, I'm studying the denoising score matching objective (DSM). I've noticed that your multi-scale approach relies heavily on it (and the original paper is quite old), so I decided that to ask my question here.

I gone through the theory of DSM and got a good grip on how it works and why it works. However, in practice I observe slow convergence (much slower convergence than with ISM) on toy examples. In particular I believe this might be due type of noise distribution selected. While not restricted, it seems everyone goes with a normal distribution since it provides a simple derivative. The derivate being 1/sigma**2 * (orig-perturbed). In practive, I've observed that the scale term in front causes the derivative to take values on the order of 1e4 for sigma=1e-2 and loss jumps around quite heavily. The smaller sigma, the slower the convergence. The loss never actually decreases, but the resulting gradient field looks comparable to what ISM gives.

Did you observe this in your experiments as well?

yang-song commented 2 years ago

Very good observation! The convergence of DSM will be plagued by large variance and will be very small for small sigma. This is a known issue but can be alleviated by control variates (see https://arxiv.org/abs/2101.03288 as an example). In our experiments we do DSM across multiple noise scales, and didn't observe slowed convergence since there are many large sigmas in the noise scales.

cheind commented 2 years ago

Ah ok, I was already planning for variance reduction methods :) For larger sigmae everything seems to be much smoother - that I observed as well. I wonder if the runtime advantage of dsm over ism is not eaten up again by slower convergence? After all, for ism, we only need the trace of the jacobian, which should be faster to compute than the entire jacobian (if frameworks like PyTorch would support such an operation). I have already a quite fast version (limited to specific NN architectures) here

https://github.com/cheind/diffusion-models/blob/189fbf545f07be0f8f9c42bc803016b846602f3c/diffusion/jacobians.py#L5

yang-song commented 2 years ago

Trace of the jacobian is still very expensive to compute. That said, there are methods like sliced score matching that do not add noise and are not affected by variance issues. I tried them in training score-based models before. They gave decent performance, but didn't seem to outperform dsm.

cheind commented 2 years ago

Yes, very true if data dimensions become large. I was thinking about (low-rank) approximations to the jacobian and came across this paper

Abdel-Khalik, Hany S., et al. "A low rank approach to automatic differentiation." Advances in Automatic Differentiation. Springer, Berlin, Heidelberg, 2008. 55-65.

which is also quite dated. But after skimming it, the idea seems connected to your sliced SM approach: as if sliced score matching computes a low-rank jacobian approximation.

Ok, thanks for your valuable time and have a nice Saturday.

cheind commented 2 years ago

I've recreated your toy-example to compare Langevin and annealed Langevin sampling. In particular, I've not used exact scores but trained a toy model to perform score prediction. The results are below. In the first figure on right plot we see default Langevin sampling (model trained unconditionally) with expected issues. The next figure (again right plot) shows annealed Langevin sampling as proposed in your paper (model trained conditioned on noise-level). The results are as expected, but I had to change one particular thing to make it work:

I believe the difference is due to the inexactness of model prediction and, of course, due to potential hidden errors in the code. Would you agree?

default_langevin annealed_langevin