Open g-jing opened 1 year ago
Hi @g-jing, thanks for your attention. As shown on page 5 (the footnote part) of our manuscript, when $I_s$ is a real image, we set the source text prompt $P_s$ as null and utilize the deterministic DDIM inversion to invert the image into a noise map. Note that DDIM inversion with text prompt and classifier-free guidance often fails to reconstruct the image [1,2], thus we set the prompt as null to ensure the quality of the source image reconstruction. You can also use the intermediate features during the inversion process to for more faithful editing results (shown in the demo playground_real.ipynb). By the way, you can also try other inversion methods like null-text inversion [1] in our editing framework. :smiley:
[1] Null-text inversion: https://arxiv.org/abs/2211.09794 [2] Plug-and-Play Diffusion Features: https://arxiv.org/abs/2211.12572
Your response is very detailed. I have some further questions:
Thanks for your response!
Hi @g-jing,
[1] Plug-and-Play Diffusion Features: https://arxiv.org/abs/2211.12572
Hi @ljzycmd , Thanks a lot! Besides replacing QK (mentioned above)and replacing (KV), did you test other types of replacement? Such as replace V or QV? Also, has the Mask step been implemented in the codebase yet?
Hi @g-jing, we also tried other types of replacement, and unsatisfying results can be obtained (I will add some cases here later). The mask extraction strategy from cross-attention maps is implemented in masactrl/masactrl.py
, thus you can refer to it for more details. Hope this can help you. :smiley:
In the paper, when doing DDIM inversion, the text prompt is a description of the real image. But in your code, the text prompt is an empty string. Could you confirm that? Thanks a lot. By the way, great work!