Training a model for ViT-L/14 image embeddings

FreddeFrallan / Multilingual-CLIP

OpenAI CLIP text encoders for multiple languages!

MIT License

765 stars 72 forks source link

Hey, Thanks for providing this awesome multilingual clip-aligned text encoder. We used it to filter the 3 billions of (image, text) pairs of laion5B https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ and it worked well. I'm also using this model to provide a multilingual search in https://rom1504.github.io/clip-retrieval/. For laion400m we used the ViT-B/32 model of openai to produce the index, but for laion5B we went with ViT-L/14 which is much more powerful. To provide the same multilingual search feature, it would be really helpful if I had a clip ViT-L/14 aligned multilingual text encoder.

Would you advise running https://github.com/FreddeFrallan/Multilingual-CLIP#training-a-new-model (and now I'm writing it, I guess I could use a subset of the multilingual set of laion5B for this) to align such a text encoder ?

Hi there, I'm happy that you found a good use case for these models. A multilingual ViT-L/14 sounds very interesting to me, and I'm fond of the idea of making large-scale models available to people.

My most extensive advice for creating a good Multilingual encoder would be to increase the number of translated data points. For example on the Swedish CLIP encoder, there’s a quantifiable difference between 2M samples and 500K. (A short M-CLIP paper has been accepted, but not yet released. But I could share it with you if you want more details). Therefore, my advice would be to machine translate as many texts from your collected dataset as possible.

The code and models in this Github repo were created during a single weekend, so you could expect better results with more data and compute.

FreddeFrallan / Multilingual-CLIP

Training a model for ViT-L/14 image embeddings #10