Stability-AI / stablediffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
38.83k stars 5.01k forks source link

Question in paper: why the text-prompt embeding uses the penultimate text embeddings of a CLIP ViT-H/14 text-encoder? #336

Open LokiXun opened 11 months ago

LokiXun commented 11 months ago

Hi, I am wondering why the Prompt-embedding in StableDiffusion extracted from the penultimate layer of CLIP ViT-H/14 text-encoder? Why not using the origin clip feature just like the image feature from CLIP image-encoder? It seems like causing the shape mismatch from the CLIP image feature, which makes simply replace texrt-embedding with CLIP image embedding without modifying model not feasible (Is that true?). I am curious how to use image as condition for cross-attn without much changes to the model. Thanks