DDIM inversion - Githubissues

dlutzzw commented 1 year ago

Hi, Thanks for your exciting work, but I'm confused by a problem while reading this paper and want to ask for your help. That is: DDIM inversion process, eps_theta(xt) in equation 3 is still related to xt, how can I get xt according to xt-1?

I re-read Zero-shot Image-to-Image Translation and DDIM, but still confused. Can you help me with this question?

ChenyangQiQi commented 1 year ago

Thanks for your interest in our paper! In equation 3:

$$ \hat{z}{t} = \sqrt{\alpha{t}} \; \frac{\hat{z}{t-1} - \sqrt{1-\alpha{t-1}}{\varepsilon\theta}}{\sqrt{\alpha{t-1}}} + \sqrt{1-\alpha{t}}{\varepsilon\theta}. $$

each $\varepsilon\theta$ stands for $\varepsilon\theta (z{t-1}, t-1, p{src}) $, where the input is the latents at $t-1$ timestep. We omit the input to make the equation more succinct. We will make it more clear in our next arxiv version.

ChenyangQiQi commented 1 year ago

Hi, @dlutzzw . Do you have other questions that I can help you with?

AshleyRm commented 1 year ago

Excellent work, I am trying to implement ddim inversion using the source code of ddim, but I have not been successful. This is the code I rewrote. Can you please take a look and see if there are any problems? def generalized_steps(x, seq, model, b, *kwargs): with torch.no_grad(): n = x.size(0) seq_next = [-1] + list(seq[:-1]) x0_preds = [] xs = [x] for i, j in zip(reversed(seq), reversed(seq_next)): t = (torch.ones(n) i).to(x.device) next_t = (torch.ones(n) j).to(x.device) at = compute_alpha(b, t.long()) at_next = compute_alpha(b, next_t.long()) xt = xs[-1].to('cuda') et = model(xt, t) x0_t = (xt - et (1 - at).sqrt()) / at.sqrt() x0_preds.append(x0_t.to('cpu')) c1 = ( kwargs.get("eta", 0) ((1 - at / at_next) (1 - at_next) / (1 - at)).sqrt() ) c2 = ((1 - at_next) - c1 * 2).sqrt() xt_next = at_next.sqrt() x0_t + c1 torch.randn_like(x) + c2 et xs.append(xt_next.to('cpu')) return xs, x0_preds def generalized_reverse_steps(x0, seq, model, b): with torch.no_grad(): n = x0.size(0) seq_next = [-1] + list(seq[:-1]) rx0_preds = [] rxs = [x0] for i, j in zip((seq_next), (seq)): t = (torch.ones(n) i).to(x0.device) last_t = (torch.ones(n) j).to(x0.device) at = compute_alpha(b, t.long()) at_last = compute_alpha(b, last_t.long()) rxt = rxs[-1].to('cuda') et = model(rxt, t) rx0_t = (rxt - et (1 - at).sqrt()) / at.sqrt() rx0_preds.append(rx0_t.to('cpu')) c2 = (1 - at_last).sqrt() xt_last = at_last.sqrt() rx0_t + c2 * et rxs.append(xt_last.to('cpu')) return rxs

dlutzzw commented 1 year ago

Hi, @dlutzzw . Do you have other questions that I can help you with?

Thanks for the quick reply, I didn't reply yesterday because of the course.

Thank you for your explanation of eps_theta, however, I learned about the derivation of this part according to the reference Zero-shot Image-to-Image Translation--3.1 in the paper, and I tried to deduce it again, the specific process is as follows. My question is, if according to your answer, the eps_theta in the formula 3 are all related to t-1, the final formula and my derivation have a gap (green writing part), I don't know why, can you help Me? My derivation always has eps(t+1), which is equivalent to eps_theta(t) in formula 3..

webwxgetmsgimg (6)

ChenyangQiQi commented 1 year ago

Hi, AshleyRm @AshleyRm . Sorry, but debugging the code only from raw text is tricky, especially for complex systems like DDIM. It is better to directly use our DDIM implementation here https://github.com/ChenyangQiQi/FateZero/blob/85321be485959d2d285044f75cf75357c90f1e14/video_diffusion/pipelines/p2pDDIMSpatioTemporalPipeline.py#L151

ChenyangQiQi commented 1 year ago

@dlutzzw Hi, dlutzzw. Your derivation is also correct. The gap is actually caused by the start index of the denoising step (0 for the above or 1 for the below). If you replace each $t$ in the above line with $t-1$, you will get the bottom line. These two equations are basically equivalent. Please check the equations in DDIM original paper again.

AshleyRm commented 1 year ago

Hi, AshleyRm @AshleyRm . Sorry, but debugging the code only from raw text is tricky, especially for complex systems like DDIM. It is better to directly use our DDIM implementation here

https://github.com/ChenyangQiQi/FateZero/blob/85321be485959d2d285044f75cf75357c90f1e14/video_diffusion/pipelines/p2pDDIMSpatioTemporalPipeline.py#L151

Thank you!

dlutzzw commented 1 year ago

@ChenyangQiQi terribly sorry. Maybe you misunderstood my question because of my laziness and inconsistent equations. My question is simple, as shown in the figure below, the two eps in equation 1 are not the same, but the formula in the paper (equation 2 in the figure) uses the same eps_theta, which confuses me.

ChenyangQiQi commented 1 year ago

@dlutzzw Thanks, I understand your question now. We will fix these two misleading equations in our next version.

dlutzzw commented 1 year ago

Hi, @ChenyangQiQi , thank you for your patience. I found the details about this part in the DDIM paper, which I ignored when I read it before. Now I figured it out, thank you very much, I will close this issue.

dlutzzw commented 1 year ago

@ChenyangQiQi hi, I would like to ask you for advice. You used attention fusion when you made up for the poor reconstruction effect of DDIM inversion. Have you tried the technique in the previous Null-text Inversion paper? I don’t know which one is better.

I think that the core ideas are the same. Null-text Inversion is optimized through MSE loss, which may be more refined.

ChenyangQiQi / FateZero

DDIM inversion #5