Version using CLIP from transformers

Thank you for your great work!

I recently switched from using OpenCLIP to transformers.CLIPTextModel as I needed some specific functions only available in the transformers implementation of CLIP. While everything is still working, I've noticed a slight drop in the quality of the results. I’m using the same ViT backbone, so I’m curious if you’ve trained models with the transformers version of CLIP and if you know why this gap in performance might occur.

AILab-CVC / VideoCrafter

Version using CLIP from transformers #93