CompVis / latent-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
11.88k stars 1.54k forks source link

DDIM Timesteps Offset #136

Open BlueAmulet opened 2 years ago

BlueAmulet commented 2 years ago

What is the significance of this line of code? https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/diffusionmodules/util.py#L57 It has a slight issue that if you want to run the model for the full 1000 timesteps, an array of [0, 999] is made, then moved to [1, 1000], and then crashes here with "IndexError: index 1000 is out of bounds for dimension 0 with size 1000" https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/diffusionmodules/util.py#L65

javyduck commented 2 years ago

I am also confused about this and met the same problem... I can not sample if I set the ddim_num_steps to 1000.

Ir1d commented 2 years ago

@rromb Could you please help us clarify this line? The DDIM sampling steps can not be 1000 due to this. Similar issue in #44

malteprinzler commented 1 year ago

I faced the same issue. Has anyone found a solution / explanation to this?

Seems like this line's only purpose is to prevent timestep=0 to be used in the ddim scheduler. This seems to be a problem because then, following line https://github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/ldm/modules/diffusionmodules/util.py#L66, alpha_t and alpha_t-1 are identical for t=1.

Using equation (12) in https://arxiv.org/pdf/2010.02502.pdf (DDIM paper) results in x0 = x1 for this case. Since x1 is normalized differently than x0 this might be the problem that is referred to as comment "add one to get the final alpha values right (the ones from first scale to data during sampling)" in https://github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/ldm/modules/diffusionmodules/util.py#L56C1-L56C1

My suggested fix would be to remove https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/diffusionmodules/util.py#L57 and to change https://github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/ldm/modules/diffusionmodules/util.py#L66 into alphas_prev = np.asarray([1.] + alphacums[ddim_timesteps[:-1]].tolist()) as this follows the convention introduced in https://arxiv.org/pdf/2010.02502.pdf (DDIM paper), equation (12)

Any thoughts on this?

malteprinzler commented 1 year ago

just checked: there are no visible qualitative differences in the image synthesis between the original implementation and my suggested fix but now you can sample ddim with 1000 steps

TorAP commented 1 year ago

Hello @malteprinzler
Did you manage to train a model with custom data? I have a few questions which I cannot seem to find answers to in the Readme..

malteprinzler commented 1 year ago

@TorAP yes i did, what questions do you have?

TorAP commented 1 year ago

Thank you! @malteprinzler

Please tell me if the questions above does not make sense . I apologies my ignorance within this field, but this is not within my 'expertise'.

Best regards

malteprinzler commented 1 year ago

@TorAP sorry, just noticed that you're talking about LDMs, I have only trained ControlNets so far, i.e. just LDM finetuning so I havent worked much with his repository here.

TorAP commented 1 year ago

I see @malteprinzler, all good. The next step for me will be using a ControlNet, once I figure out the training steps of LDM.s :-)

Did you have any success with training your own LDM @BlueAmulet, and do you have any input on my questions? :)

haohang96 commented 11 months ago

I faced the same issue. Has anyone found a solution / explanation to this?

Seems like this line's only purpose is to prevent timestep=0 to be used in the ddim scheduler. This seems to be a problem because then, following line

https://github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/ldm/modules/diffusionmodules/util.py#L66

, alpha_t and alpha_t-1 are identical for t=1. Using equation (12) in https://arxiv.org/pdf/2010.02502.pdf (DDIM paper) results in x0 = x1 for this case. Since x1 is normalized differently than x0 this might be the problem that is referred to as comment "add one to get the final alpha values right (the ones from first scale to data during sampling)" in https://github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/ldm/modules/diffusionmodules/util.py#L56C1-L56C1

My suggested fix would be to remove https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/diffusionmodules/util.py#L57 and to change

https://github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/ldm/modules/diffusionmodules/util.py#L66

into alphas_prev = np.asarray([1.] + alphacums[ddim_timesteps[:-1]].tolist()) as this follows the convention introduced in https://arxiv.org/pdf/2010.02502.pdf (DDIM paper), equation (12) Any thoughts on this?

@malteprinzler Hello, I agree with you about "Using equation (12) in https://arxiv.org/pdf/2010.02502.pdf (DDIM paper) results in x0 = x1 for this case. ". But I wonder what is the meaning of your statement "Since x1 is normalized differently than x0 ". Could you please explain it? what the "normalize" stands for in the context?

malteprinzler commented 11 months ago

@haohang96 Maybe the term "normalized" was a bit misleading.

Referring to equation 4 from the ddim paper: https://arxiv.org/pdf/2010.02502.pdf image

we get x_1 = sqrt(alpha_1)x_0 + sqrt(1-alpha_1)epsilon

which obviously != x_0.

However, note that as outlined above, the implementation treats the predictions of x_1 to be equal to the prediction of x_0.

Please read my comment just as "You shouldn't use the prediction for x_1 as the prediction for x_0"

haohang96 commented 11 months ago

Thanks for your reply :)

Do you mean that x1 = sqrt(alpha1) x0 + sqrt(1 - alpha1) eps and treat x0 = sqrt(alpha0) x0 + sqrt(1-alpha0) eps (although x0 can be sampled directly from real data, without adding noise).

if set alpha1 == alpha0, then x0=x1, which is uncorrect?

malteprinzler commented 11 months ago

my point was more that currently, the predictions for x1 are effectively used as the final output of the denoising process, which is not correct

haohang96 commented 11 months ago

my point was more that currently, the predictions for x1 are effectively used as the final output of the denoising process, which is not correct

Hello @malteprinzler , could I elucidate the points you raised with the following derivation?

Referring to Eq.12 in DDIM:

When offset=1 is not utilized, leading to $\boldsymbol {\alpha_0=\alpha_1}$. Considering $\boldsymbol{\sigma_1=0}$, the equation simplifies as follows:

In essence, this implies that the predictions for x1 effectively serve as the final output of the denoising process, contradicting your earlier derivation: $x_0 \neq x_1$.

malteprinzler commented 11 months ago
  1. I think you are referring to equation 12 instead of 21

  2. I think you are exactly deriving what I tried to express, namely, that with the current implementation, the final prediction, hence x_0 equals the prediction for x_1. The problem that I see is your assumption that alpha_0 and alpha_1 are identical. From how I understand the paper, all alpha_t, t=1...T are copied from the original diffusion model such that ddim can be used without retraining of diffusion models. Note that additionally, alpha_0 = 1 is defined in the ddim paper just below equation 12, however, in the current implementation, alpha_0 is set to a value different from 1 (see my original comment above). If you actually set alpha_0 to 1 in equation 12, you will see that the prediction of x_0 will be identical to the term that is marked as "predicted x0". However, the current implementation effectively sets alpha_0 to alpha_1!=1, which as you correctly derived leads to the prediction of x0 = x1. So practically, the last denoising step has no effect! Note however, that x1 will not follow the distribution of GT images! It will follow the distribution (at least in approximation) that the x_1 followed during training the diffusion model, hence a slightly noised version of the ground truth distribution. In practice, this noise level is quite small so you probably wont notice a big visual difference, it is just mathematically not quite correct.

Hope that helps.

haohang96 commented 11 months ago

Thanks very much for your response, I got it!