When doing source image DDIM inversion, should the text prompt be empty?

g-jing commented 1 year ago

In the paper, when doing DDIM inversion, the text prompt is a description of the real image. But in your code, the text prompt is an empty string. Could you confirm that? Thanks a lot. By the way, great work!

ljzycmd commented 1 year ago

Hi @g-jing, thanks for your attention. As shown on page 5 (the footnote part) of our manuscript, when $I_s$ is a real image, we set the source text prompt $P_s$ as null and utilize the deterministic DDIM inversion to invert the image into a noise map. Note that DDIM inversion with text prompt and classifier-free guidance often fails to reconstruct the image [1,2], thus we set the prompt as null to ensure the quality of the source image reconstruction. You can also use the intermediate features during the inversion process to for more faithful editing results (shown in the demo playground_real.ipynb). By the way, you can also try other inversion methods like null-text inversion [1] in our editing framework. :smiley:

[1] Null-text inversion: https://arxiv.org/abs/2211.09794 [2] Plug-and-Play Diffusion Features: https://arxiv.org/abs/2211.12572

g-jing commented 1 year ago

Your response is very detailed. I have some further questions:

If you do not use a source prompt for source image reconstruction, how do you get P_s, I_s, and M_s in Equation 6?
In equation 5, Why do you choose to replace K and V but keep Q? P2P paper found that the result of QK could represent the object mask, why don't you also QK but only use Q? Did you find Q*K results can not represent object shape in self-attention layer?
For equation 6, if I understand correctly, the M_s is applied to the result of Q*K, then the resulting new attention map will multiply with V to get the f. Please correct me if I am wrong. Also, is this Mask step implemented in the code?

Thanks for your response!

ljzycmd commented 1 year ago

Hi @g-jing,

Actually, the cross-attention map with null text tokens can still be used to extract masks associated with the foreground object. Thus we can obtain the $M_s$ in the source image. In Eq. 6, $I_s$ is the input real image and $P_s$ is the null text. Besides the mask extraction from cross-attention maps, the mask also can be obtained by existing segmentation models.
In Eq. 5, we use the query Q in the target image to query contents from the source image, since the query features of the source and target images are much similar (shown in Fig. 4(b)). In P2P, the cross-attention map can represent the object shape, thus layout-fixed editing can be performed by directly modifying the text prompt, yet it cannot perform content-consistent and non-rigid editing. In self-attention, we also find that the self-attention maps can maintain the image layout, which is similar to the observations in [1]. However, utilizing QK cannot maintain the source contents unchanged! In other words, the synthesized image is content-inconsistent. I will add some cases later.
Your understanding is correct. The mask in the attention can query information in the restricted regions [object_source<-->object_target, background_source<-->background_target], thus the problem of confusion can be alleviated.

[1] Plug-and-Play Diffusion Features: https://arxiv.org/abs/2211.12572

g-jing commented 1 year ago

Hi @ljzycmd , Thanks a lot! Besides replacing QK (mentioned above)and replacing (KV), did you test other types of replacement? Such as replace V or QV? Also, has the Mask step been implemented in the codebase yet?

ljzycmd commented 1 year ago

Hi @g-jing, we also tried other types of replacement, and unsatisfying results can be obtained (I will add some cases here later). The mask extraction strategy from cross-attention maps is implemented in masactrl/masactrl.py, thus you can refer to it for more details. Hope this can help you. :smiley:

TencentARC / MasaCtrl

When doing source image DDIM inversion, should the text prompt be empty? #6