Closed kaanakan closed 3 weeks ago
Yes I've tried that, basically using SD as the slot decoder. But I didn't go very far as fine-tuning requires huge memory. If I freeze it, I cannot learn good object-centric representation.
But you can check out the Stable-LSD variant in this work, they have shown some promising results using SD.
Thank you for your response! If possible, could you kindly share the results you have for the COCO and VOC datasets? I’d greatly appreciate it.
Best regards,
I think I don't have them anymore. Well, even in Stable-LSD, the reconstruction results are not very good TBH. Also IIRC, it's not capable of compositional generation at all. So I don't know if this can even work -- the part-whole ambiguity is too complicated in real-world data, unsupervised decomposition is just too hard.
There is another paper you might be interested in: https://arxiv.org/abs/2407.17929 They use more pre-trained knowledge to generate pseudo masks to supervise their slot + SD model. They can get good segmentation results, but still the generation is quite bad I think
Hello,
Thank you for sharing your work!
I wanted to inquire whether you've explored integrating pretrained diffusion models like Stable Diffusion v1.5 or v2.1 into your project. If so, I’d love to hear more about the results and any insights you can share.
Thanks in advance for your time and assistance!