Question in paper: why the text-prompt embeding uses the penultimate text embeddings of a CLIP ViT-H/14 text-encoder?

Hi, I am wondering why the Prompt-embedding in StableDiffusion extracted from the penultimate layer of CLIP ViT-H/14 text-encoder? Why not using the origin clip feature just like the image feature from CLIP image-encoder? It seems like causing the shape mismatch from the CLIP image feature, which makes simply replace texrt-embedding with CLIP image embedding without modifying model not feasible (Is that true?). I am curious how to use image as condition for cross-attn without much changes to the model. Thanks

Stability-AI / stablediffusion

Question in paper: why the text-prompt embeding uses the penultimate text embeddings of a CLIP ViT-H/14 text-encoder? #336