Hi, I am wondering why the Prompt-embedding in StableDiffusion extracted from the penultimate layer of CLIP ViT-H/14 text-encoder?
Why not using the origin clip feature just like the image feature from CLIP image-encoder?
It seems like causing the shape mismatch from the CLIP image feature, which makes simply replace texrt-embedding with CLIP image embedding without modifying model not feasible (Is that true?).
I am curious how to use image as condition for cross-attn without much changes to the model. Thanks
Hi, I am wondering why the Prompt-embedding in StableDiffusion extracted from the penultimate layer of CLIP ViT-H/14 text-encoder? Why not using the origin clip feature just like the image feature from CLIP image-encoder? It seems like causing the shape mismatch from the CLIP image feature, which makes simply replace texrt-embedding with CLIP image embedding without modifying model not feasible (Is that true?). I am curious how to use image as condition for cross-attn without much changes to the model. Thanks