bloc97 / CrossAttentionControl

Unofficial implementation of "Prompt-to-Prompt Image Editing with Cross Attention Control" with Stable Diffusion
MIT License
1.27k stars 91 forks source link

About the finite difference gradient descent method #25

Open BrandonHanx opened 1 year ago

BrandonHanx commented 1 year ago

Hi @bloc97 ,

Thanks for your great work.

Do you know any other papers/implementations using the finite difference gradient descent to do inversion? I want more references for this solution.

Also, could you please give more hints about the magic number tless?

bloc97 commented 1 year ago

Hi, unfortunately there isn't any reference for the "ad hoc" method I used to compensate for CFG, but I can give a quick explanation, if you have more questions we can discuss this further...

Because ODEs used in diffusion models are somewhat sensitive to initial conditions, using the CFG "vector" at t-1 to invert and find the t latent does not give the correct answer (seen in the fact that it is not always possible to invert a generated image back to the latent if the CFG is high). The correct answer is found by finding what CFG vector at t gives the correct t-1 latent, but since we do not know the latent at t in the first place, how can we find the CFG vector? One solution is to use a gradient descent approximation, where we first use the wrong CFG vector (at t-1) to get an approximation of the latent at t, then do a forward diffusion pass to re-obtain our latent at t-1, we can then compute the difference and use gradient descent on the CFG vector.

In my simple implementation, I am assuming that the latent landscape near our point of interest (latent at t) is a convex and smooth function (which is most likely wrong), thus I am directly doing gradient descent on the latent at t using the difference of the ground truth and predicted t-1. (The numerically correct method would be to do backprop through the model twice, but it would be too slow...) This solution provided here is literally an approximation of an approximation, but works quite well for images generated by Stable Diffusion. In my tests, images that were produced using a CFG of up to 5.5 can be reasonably well inverted. For real images, the results are satisfactory in most cases up to a CFG of 4.5, but some images cannot be inverted at all.

For the magic number, it was found empirically. If tless is not used, sometimes the result diverges when re-diffusing the inverted latent and you get a completely grey image.