Closed KevinWang676 closed 1 year ago
print out the dimension of each variable, you then will figure it out
Thank you. Also, I wonder if there is any specific reason why you chose $\gamma_t
$ to be 0.1 or 0.15, which is relatively a small number. Does the choice of $\gamma_t
$ come from your experiments or from some theoretical results? Thanks!
@KevinWang676 From our experiments, and it is written in our paper
Thanks. In the paper you mentioned that to select a constant $\gamma
$ you "search on a small range of values". Is it possible that you miss some "big" values of $\gamma
$, say $\gamma > 1
$, that may also lead to good results?
@KevinWang676 Too strong regularization would break down the original noise prediction
Thanks! I wonder if the code new_noise = noise + gamma * th.randn_like(noise)
is essentially the same as noise = torch.randn_like(latents) + 0.1 * torch.randn(latents.shape[0], latents.shape[1], 1, 1)
proposed in the Diffusion With Offset Noise blog.
refer to this
That makes sense, thank you. Also, in your paper you mentioned $\gamma = 0.1
$ is the best value for cifar10 dataset, but in your repo, you used $\gamma =0.15
$ for cifar10. Is there any reason for doing so? Thanks.
for most datasets, we find gamma=0.1 is a good option using ADM code. For cifar10, gamma=0.15 actually works better than gamma=0.1 in my recent experiments. Overall, you can try gamma between (0.1, 0.15) to find the optimal one for your own dataset.
Got it, thank you! Could you explain to me why $\textbf{y}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} (\mathbf{\epsilon} + \gamma_t \mathbf{\xi})
$ would lead to better results than $\textbf{y}_t = \sqrt{\bar{\alpha}_t}\textbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \sqrt{1+ \gamma^{2}} \mathbf{\epsilon}^{\prime}
$? I'm a little confused about it since they actually have the same distribution.
I think the reason is that $\mathbf{\epsilon}^{\prime}
$ is multiplied by an extra factor $\sqrt{1+\gamma^{2}}
$, which is greater than $1
$, and this makes the prediction less accurate. Am I right? Thanks.
Got it, thank you! Could you explain to me why yt=α¯tx0+1−α¯t(ϵ+γtξ) would lead to better results than yt=α¯tx0+1−α¯t1+γ2ϵ′? I'm a little confused about it since they actually have the same distribution.
DDPM-IP and DDPM-y share the same input y_t, but they have different training target
Thank! But I wonder how to determine which term is the training target because in the expression $\textbf{y}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} (\mathbf{\epsilon} + \gamma_t \mathbf{\xi})
$ the term $\mathbf{\epsilon}
$ seems to have the same contribution as $\mathbf{\xi}
$. How can we know $\mathbf{\epsilon}
$ is actually the training target rather than $\mathbf{\xi}
$? Thank you.
Thank! But I wonder how to determine which term is the training target because in the expression yt=α¯tx0+1−α¯t(ϵ+γtξ) the term ϵ seems to have the same contribution as ξ. How can we know ϵ is actually the training target rather than ξ? Thank you.
training target is determined by your loss function
Got it, thank you.
Hi, I got an error called "ValueError: only one element tensors can be converted to Python scalars" when I tried to use a different noise schedule for $
\xi
$. I want $\gamma_{0}, \cdots, \gamma_{T}
$ to have differents values as $t
$ increases, but it seems that I can't just introduce a different parameternew_nosie
that is dependent on $t
$. The error is shown below. Could you help me resolve this issue? Thank you!