ShihaoZhaoZSH / Uni-ControlNet

[NeurIPS 2023] Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
MIT License
595 stars 42 forks source link

questions about the paper #6

Closed betterze closed 1 year ago

betterze commented 1 year ago

Dear Shihao Zhao,

Thank you for sharing this nice work. I really like it.

  1. Could you show more qualitative comparisons between Injection-S2 and the full method for multiple conditions? In Fig 7, I only see one case (elephant temple) that Injection-S2 is clearly worse than the full method.
  2. In Fig 15, is the input to the feature extractor a fixed shape tensor of local conditions? When you drop a specific condition, do you set the corresponding channel as zero? Did you mask out different parts of local conditions, such that each condition controls a different spatial condition?

Thank you for your help.

Best Wishes,

Alex

ShihaoZhaoZSH commented 1 year ago
  1. In Fig 7, for the case of "Gorilla wearing glasses", the gorilla’s eyes and glasses are not properly integrated. And here are more examples. We can find the hands of the Stormtrooper and the trunk are not well merged in the first row. Moreover, the background elements (trees, stones) do not match the depth map well for Injection-S2 in the two cases below.

    more comp
  2. i. Yes, it is a fixed shape tensor. ii. Yes, when dropout occurs, the corresponding channels are set to zero. iii. No, there is no mask. But masking out different parts for different local conditions could be one potential approach. We will provide more details in the training code.

betterze commented 1 year ago

Thx for your reply. I really appreciate it.