questions about the paper

betterze commented 1 year ago

Dear Shihao Zhao,

Thank you for sharing this nice work. I really like it.

Could you show more qualitative comparisons between Injection-S2 and the full method for multiple conditions? In Fig 7, I only see one case (elephant temple) that Injection-S2 is clearly worse than the full method.
In Fig 15, is the input to the feature extractor a fixed shape tensor of local conditions? When you drop a specific condition, do you set the corresponding channel as zero? Did you mask out different parts of local conditions, such that each condition controls a different spatial condition?

Thank you for your help.

Best Wishes,

Alex

ShihaoZhaoZSH commented 1 year ago

In Fig 7, for the case of "Gorilla wearing glasses", the gorilla’s eyes and glasses are not properly integrated. And here are more examples. We can find the hands of the Stormtrooper and the trunk are not well merged in the first row. Moreover, the background elements (trees, stones) do not match the depth map well for Injection-S2 in the two cases below.
i. Yes, it is a fixed shape tensor. ii. Yes, when dropout occurs, the corresponding channels are set to zero. iii. No, there is no mask. But masking out different parts for different local conditions could be one potential approach. We will provide more details in the training code.

betterze commented 1 year ago

Thx for your reply. I really appreciate it.

ShihaoZhaoZSH / Uni-ControlNet