MinkaiXu / GeoDiff

Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).
MIT License
327 stars 71 forks source link

Questions about the rescale problem #18

Open Frankie123421 opened 2 years ago

Frankie123421 commented 2 years ago

Hi, Xu. Thanks for sharing the code. I've noticed the discussion here (https://github.com/MinkaiXu/GeoDiff/issues/11) and carefully read the code line by line. Just as what you stated in the issue 11, the "diffusion" process in the code is actually rescaled compared to the paper, i.e., $\mathcal{C}^t = \frac{1}{\sqrt{\alpha_t}}(\sqrt{\alpha_t}C^0 + \sqrt{1-\alpha_t}\epsilon)$. Based on the paper ScoreSDE (https://arxiv.org/abs/2011.13456), DDPM is a variance preserving process and DSM is a variance exploding one. I think maybe there might be some typos in your answer to issue 11 which cause contradiction: "2) use the alpha to rescale the data to achieve variation preserving" and "the problem of variation preserving is: it will change the scale of coordinates". In my perspective, after rescaling, $\mathcal{C}^t = C^0 + \frac{\sqrt{1-\alpha_t}}{\sqrt{\alpha_t}}\epsilon$ is a DSM process with variance increasing along with $t$. So I am confused about why this rescaling method will hold the scale of coordinates since in my view it seems to corrupt the scale (increase the variance) instead.

MinkaiXu commented 1 year ago

Hi, Thanks for your interest! I think I can fully understand and actually agree with your statement. By "change the scale of coordinates", I refer to that in my experiment, the data is usually on scales larger than 1. Then in my case, variance preserving will shrink the data, while variance exploring won't. The difference here is, by contrast, for DSM and DDPM papers the image is usually rescaled to [0,1]. And overall, I would say the rescale is just a trick I found during the implementation. By my discussion at https://github.com/MinkaiXu/GeoDiff/issues/11, I just want to clarify that indeed there is an underlying variance-preserving process.

Frankie123421 commented 1 year ago

Thanks for your kind reply. Yep, I kind of got it later. The rescale one keeps the mean $\mathcal{C}^0$ while the original DDPM shrinks the data by the factor $\sqrt{\alpha_t}$ (which makes the construction of radius graph failed), though I still think the rescale one is variance exploding as $\frac{\sqrt{1-\alpha_t}}{\sqrt{\alphat}} \rightarrow \infty, t \rightarrow \infty$.
Another question is that in the chain-rule method, the calculation of $\nabla \log
{d^t} q(d^t|d^0)$ is approximated by $-\frac{\sqrt{\alpha_t}(d^t-d^0)}{1 - \alpha_t}$. In my perspective, this approximation indeed assumes that the pairwise distance follows the same perturbation process of the coordinates, i.e., $d^t = \frac{1}{\sqrt{\alpha_t}}(\sqrt{\alpha_t}d^0 + \sqrt{1-\alphat}\epsilon)$, and the "transformed" score of positions $\nabla \log{C^t} q(C^t|C^0)$ is obtained by chain-rule. I wonder how this assumption is made and why it's reasonable. Though this transformation makes the transformed added noise (score) equivariant to $C^t$, it obscures the relation of the truly added score and the transformed one. In some cases like I need to recenter the (truly) added noises at their CoM, it seems that the transformed ones can't naturally satisfy this requirements. And if I further force the transformed ones and the outputs of model to be also recentered at their CoMs during the training, the testing result will be bad. I guess it's because the recentered transformed noises fall away from the truly ones.