aim-uofa / FreeCustom

[CVPR 2024] Official PyTorch implementation of FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition
https://aim-uofa.github.io/FreeCustom/
MIT License
62 stars 0 forks source link

Limitation on concept types #5

Open baaaad opened 2 weeks ago

baaaad commented 2 weeks ago

Thanks for the great work. I noticed that the concepts mentioned in your paper, such as 'cat' and 'phone,' exhibit clear differences in semantic spaces. However, the model's performance is notably inadequate when handling 'close' concepts like 'a man' and 'a woman,' particularly in generating images for complex sentences such as 'a man and a woman are dancing'. Is this limitation inherent to the method itself? Does it only apply well to examples mentioned in the paper which must contain context interaction and with clear differences in semantic spaces?

wzic commented 2 weeks ago

I guess that's because SD1.5 is not good at handling human-related generation

baaaad commented 2 weeks ago

I guess that's because SD1.5 is not good at handling human-related generation

The model also struggles to generate "a dog" and "a cat" together as "a dog and cat". And using SD2.1 did not solve the mentioned issue. Thus I have concerns regarding the ability of the method to combine `close' concepts.

dingangui commented 1 week ago

Thanks for the great work. I noticed that the concepts mentioned in your paper, such as 'cat' and 'phone,' exhibit clear differences in semantic spaces. However, the model's performance is notably inadequate when handling 'close' concepts like 'a man' and 'a woman,' particularly in generating images for complex sentences such as 'a man and a woman are dancing'. Is this limitation inherent to the method itself? Does it only apply well to examples mentioned in the paper which must contain context interaction and with clear differences in semantic spaces?

Yes, we think this limitation comes from the poor basic ability of the model to distinguish semantic similar concepts. And our model should be given input concepts with contextual interaction to generate better results.