Lingzhi-Pan / PILOT

Official Implement of the work "Coherent and Multi-modality Image Inpainting via Latent Space Optimization"
MIT License
40 stars 2 forks source link

Implementation details #2

Closed CharlesGong12 closed 2 weeks ago

CharlesGong12 commented 1 month ago

Hi, what an excellent work!

I have a few questions about the paper.

  1. You set coherence scale $\gamma$ to 1, so the blending stage, which is mentioned in section 4.2, is totally not used?
  2. Most existed inpainting methods set DDIM steps as 50 but you set it as 200. I wonder do you use 200 steps only for your PILOT, or use 200 steps for all baselines? This touches fair comparison.
  3. In MSCOCO dataset, what prompt do you use? The whole caption "a blue bike parked on a sidewalk" or the object class "a blue bike"?
  4. Do you run the best baselines PowerPaint and Brushnet? It seems that these 2 methods perform extremely well.
  5. The eq.15 sets negative infinity for the interaction between the inpainting region and exsited region, where I think it will damage the image coherence. But you mentioned eq.15's goal is that "the inpainted region has to be coherent with the background region and have no significant semantic differences". Could you explain this equation?

Looking forward to your reply!

Lingzhi-Pan commented 3 weeks ago

Hi, here is my response. 1 & 4. We will include the corresponding results in the final version of our paper.

  1. We use 200 steps for all baselines.
  2. Only the object class.
  3. Yes, it can to some extent damage the coherence of the image, but this approach ensures that the semantics do not overflow into the masked area. Therefore, we only use this strategy during the initial stages of semantic formation, as mentioned in our paper. In the later denoising steps, we do not use this strategy, allowing for interaction between the masked area and the background, which is sufficient to achieve information consistency. Our extensive experiments have demonstrated that this approach can effectively maintain image coherence. Thank you for your attention to our work :)
CharlesGong12 commented 3 weeks ago

Many thanks to your reply!

So is the Blending Stage indeed totally not used? I find that in ablation study, gamma is recommended as 0.5 but in Experimental setup you set it to 1. In ablation study you discuss about the effect of this stage, such as the unet could achieve coherence in the boundary area. So why do you discard the blending stage?

或许学长方便留下邮箱地址吗?论文中没有放你的联系方式,期待进一步讨论。

Lingzhi-Pan commented 2 weeks ago

Hi, the blending stage is designed to speed up the image inpainting process, as optimizing the latent space throughout the entire process can be time-consuming. We discovered that during the earlier denoising stages, stable semantic generation could be achieved by optimizing the latent space. In the subsequent stages, by not optimizing and instead using latent blending, coherent images can be produced, thus saving computation time. In our paper, we introduced a coherence scale, gamma, to balance image coherence and computation time. When gamma is set to 0.5, it generally results in the stable generation of high-quality images, and the larger the gamma, the more coherent the images become. 我的邮箱是lz.pan@outlook.com,欢迎和我交流