Paper & implementation differences

man-sean commented 1 year ago

Hi, There are a few differences between the paper and this repository and it will be wonderful if you could clarify for me the reasons behind them:

The reported gaussain-noisy experiments in the paper use sigma_y=0.05, and indeed in the config files config['noise']['sigma']=0.05. But while the images are stretchered from [0,1] to [-1,1], the sigma is unchanged – meaning that in practice the noise added is with std sigma/2, i.e. y_n is cleaner compared to the reported settings in the paper. This can be easily checked by computing torch.std(y-yn) after the creation of y and y_n in sample_condition.py.
The paper defines the step-size scalar as a constant divided by the norm of the gradient (Appendix C.2), meaning that we always normalize the gradient before scaling it. In the code, the constant is defined in config['conditioning']['params']['scale'] and used in PosteriorSampling.conditioning() to scale the gradient, but we never normalized the gradient in the first place (in PosteriorSampling.grad_and_value() for example). By adding the gradient normalization the method seems to break.
For the gaussian FFHQ-SRx4 case, Appendix D.1 defines the scale as 1.0, but configs/super_resolution_config.yaml uses 0.3.

Thank you for your time and effort!

berthyf96 commented 1 year ago

For (2), I think the authors apply the normalization factor before taking the gradient. If you look at ConditioningMethod.grad_and_value (here), they take the gradient of the norm, not the norm squared.

I believe there's another difference between Alg. 1 of the paper and the code. In EpsilonXMeanProcessor.predict_xstart (here), the coefficient applied to the score-model output is different from the coefficient in line 4 of Alg. 1. In the paper, the coefficient is $(1-\bar{\alpha}_i)/\sqrt{\bar{\alpha}_i}$, but in the code, it is $-1/\sqrt{\bar{\alpha}_i-1}$.

claroche-r commented 1 year ago

@berthyf96, for your second point regarding "EpsilonXMeanProcessor.predict_xstart", I also did not understand the difference until I realized that the score function $\widehat{s}(xt)$ associated with a noise predictor $\epsilon\theta(x_t)$ is: $$\widehat{s}(xt) = \nabla{xt} \log p\theta(x_t) = - \frac{1}{\sqrt{1-\bar{\alphat}}} \epsilon\theta(x_t) $$ See Equation (11) here. Injecting this result into the expression of $\widehat{x}_0$ of Alg 1 gives the implemented results.

berthyf96 commented 1 year ago

@claroche-r thanks so much for clarifying that!

Mally-cj commented 6 months ago

thank you!

DPS2022 / diffusion-posterior-sampling

Paper & implementation differences #6