An error when using a different noise schedule

forever208 / DDPM-IP

[ICML 2023] official implementation for "Input Perturbation Reduces Exposure Bias in Diffusion Models"

MIT License

101 stars 9 forks source link

An error when using a different noise schedule #15

Closed KevinWang676 closed 1 year ago

KevinWang676 commented 1 year ago

Hi, I got an error called "ValueError: only one element tensors can be converted to Python scalars" when I tried to use a different noise schedule for $\xi$. I want $\gamma_{0}, \cdots, \gamma_{T}$ to have differents values as $t$ increases, but it seems that I can't just introduce a different parameter new_nosie that is dependent on $t$. The error is shown below. Could you help me resolve this issue? Thank you!

forever208 commented 1 year ago

print out the dimension of each variable, you then will figure it out

KevinWang676 commented 1 year ago

Thank you. Also, I wonder if there is any specific reason why you chose $\gamma_t$ to be 0.1 or 0.15, which is relatively a small number. Does the choice of $\gamma_t$ come from your experiments or from some theoretical results? Thanks!

forever208 commented 1 year ago

@KevinWang676 From our experiments, and it is written in our paper

KevinWang676 commented 1 year ago

Thanks. In the paper you mentioned that to select a constant $\gamma$ you "search on a small range of values". Is it possible that you miss some "big" values of $\gamma$, say $\gamma > 1$, that may also lead to good results?

forever208 commented 1 year ago

@KevinWang676 Too strong regularization would break down the original noise prediction

KevinWang676 commented 1 year ago

Thanks! I wonder if the code new_noise = noise + gamma * th.randn_like(noise) is essentially the same as noise = torch.randn_like(latents) + 0.1 * torch.randn(latents.shape[0], latents.shape[1], 1, 1) proposed in the Diffusion With Offset Noise blog.

forever208 commented 1 year ago

refer to this

KevinWang676 commented 1 year ago

That makes sense, thank you. Also, in your paper you mentioned $\gamma = 0.1$ is the best value for cifar10 dataset, but in your repo, you used $\gamma =0.15$ for cifar10. Is there any reason for doing so? Thanks.

forever208 commented 1 year ago

for most datasets, we find gamma=0.1 is a good option using ADM code. For cifar10, gamma=0.15 actually works better than gamma=0.1 in my recent experiments. Overall, you can try gamma between (0.1, 0.15) to find the optimal one for your own dataset.

KevinWang676 commented 1 year ago

Got it, thank you! Could you explain to me why $\textbf{y}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} (\mathbf{\epsilon} + \gamma_t \mathbf{\xi})$ would lead to better results than $\textbf{y}_t = \sqrt{\bar{\alpha}_t}\textbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \sqrt{1+ \gamma^{2}} \mathbf{\epsilon}^{\prime}$? I'm a little confused about it since they actually have the same distribution.

KevinWang676 commented 1 year ago

I think the reason is that $\mathbf{\epsilon}^{\prime}$ is multiplied by an extra factor $\sqrt{1+\gamma^{2}}$, which is greater than $1$, and this makes the prediction less accurate. Am I right? Thanks.

forever208 commented 1 year ago

Got it, thank you! Could you explain to me why yt=α¯tx0+1−α¯t(ϵ+γtξ) would lead to better results than yt=α¯tx0+1−α¯t1+γ2ϵ′? I'm a little confused about it since they actually have the same distribution.

DDPM-IP and DDPM-y share the same input y_t, but they have different training target

KevinWang676 commented 1 year ago

Thank! But I wonder how to determine which term is the training target because in the expression $\textbf{y}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} (\mathbf{\epsilon} + \gamma_t \mathbf{\xi})$ the term $\mathbf{\epsilon}$ seems to have the same contribution as $\mathbf{\xi}$. How can we know $\mathbf{\epsilon}$ is actually the training target rather than $\mathbf{\xi}$? Thank you.

forever208 commented 1 year ago

Thank! But I wonder how to determine which term is the training target because in the expression yt=α¯tx0+1−α¯t(ϵ+γtξ) the term ϵ seems to have the same contribution as ξ. How can we know ϵ is actually the training target rather than ξ? Thank you.

training target is determined by your loss function

KevinWang676 commented 1 year ago

Got it, thank you.