The clip vision model that matches the clip text model in sd-2.1-base

Hi, thanks for your great work!

I have trouble in finding the open-source clip model checkpoint that matches the clip used in stable-diffusion-2-1-base. You mentioned that you used OpenCLIP-ViT/H as the text encoder. I tried CLIP-ViT-H-14-laion2B-s32B-b79K provided in HuggingFace and open_clip but found the output embedding of text prompt does not matches the output of text encoder used in stable-diffusion-2-1-base. Could you tell me where is the checkpoints of the OpenCLIP-ViT/H?

The same issue is proposed in huggingface too but has no official answer yet.

Thank you!

Stability-AI / stablediffusion

The clip vision model that matches the clip text model in sd-2.1-base #330