Thank you for your wonderful work!

It's so amazing for me to see such wonderful editing results just by a finger drag! Simple but effective work, sharp observation in UNet! As a junior student major in CS, I have some naive questions about this work:

Why do you choose LoRA to maintain identity against the distortion during the editing? Have you tried some other common training tricks, for example, to finetune the VAE encoder/decoder?
From my perspective, the modification in Diffusion latents space during sampling is unexplainable especially in some big 't', because the SNR is so low. So I am curious what will it be like in the ablation results Fig(6) in your arXiv paper without LoRA finetuning and t=50, will it be damaged and out of the natural image distribution?
Your work probably claims that spatial information and semantic feature is somehow correlated in the latent space of Diffusion models. So will the merging of two images smoothly through latent space be practical? Have you any proposals in this field now or in the future? Some of my questions are too naive. But if you can spend some time to reply one or two of them, I will be very appreciate it~

Yujun-Shi / DragDiffusion

Thank you for your wonderful work! #56