Stability-AI / stablediffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
38.33k stars 4.95k forks source link

The clip vision model that matches the clip text model in sd-2.1-base #330

Open WUyinwei-hah opened 11 months ago

WUyinwei-hah commented 11 months ago

Hi, thanks for your great work!

I have trouble in finding the open-source clip model checkpoint that matches the clip used in stable-diffusion-2-1-base. You mentioned that you used OpenCLIP-ViT/H as the text encoder. I tried CLIP-ViT-H-14-laion2B-s32B-b79K provided in HuggingFace and open_clip but found the output embedding of text prompt does not matches the output of text encoder used in stable-diffusion-2-1-base. Could you tell me where is the checkpoints of the OpenCLIP-ViT/H?

The same issue is proposed in huggingface too but has no official answer yet.

Thank you!