Closed Pang-Yatian closed 7 months ago
I read your model json file and find you may refer to Stable Diffusion Image Variants v2, which is actually fine-tuned from SD v1?
Hi, we adopt the UNet from Stable Diffusion V2 and the CLIP embeddings from Stable Diffusion Image Variants (finetuned on SD V1).
Here is the story behind it: At first, we condition classifier-free guidance on Stable Diffusion V2 with extra CLIP embeddings. We empirically found that utilizing classifier-free guidance will enhance geometric performance on some wired in-the-wild images, despite dropping accuracy on test-set datasets and more consumed inference time. To increase geometric accuracy and improve stability, we didn't use classifier-free guidance in the end. However, we find that retaining CLIP embeddings can still enhance the in-the-wild visual geometry effect.
Here are our thoughts for the future: Nowadays, AI-generated images are prevalent in computer vision tasks. Since the GT attributes for them (especially wired cases) are unavailable, we may only need the depth & normal maps that best describe their details and spatial structure. It's ok to sacrifice some accuracy. To this end, CFG will help with it. However, since geometry estimation is a long-standing important task, we may still put accuracy in the first position right now. But I think this paradigm will change in the near future with the emergence of more AIGC tasks.
Sorry for the typos on the paper, and we will update it soon. BTW, we are also training this model on other base diffusion models, such as stable-diffusion-2-1-unclip. We will release the updated checkpoints if we find some advanced ones.
Thanks a lot for the reply!
Thank you! It would be great if you could share some results with classifier-free guidance in the future. I'm currently using depth estimation models for generative experiments (such as with AI-generated images), so I am curious about how it can improve performance! Even marigold does not use classifier-free guidance at the moment.
Hi, here are examples of the comparison, where model w/ cfg can generate better visual effect.
Hi, Thanks for your great work.
You mentioned your model is fine-tuned from "pre-trained stable diffusion v2, which has been fine-tuned with image conditions." I was wondering that do you pre-train it by yourself or using an open-sourced one, as SD v2 is initially a text-to-image model if I understand correctly.
Looking forward to your reply.