CLIP for image and text embeddings

JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing

https://sparknlp.org/

Apache License 2.0

3.76k stars 703 forks source link

CLIP for image and text embeddings #14311

Open txhno opened 1 month ago

txhno commented 1 month ago

Link to the documentation pages (if available)

https://github.com/patrickjohncyh/fashion-clip https://huggingface.co/patrickjohncyh/fashion-clip

How could the documentation be improved?

Its a finetune of CLIP trained on a 500K Fashion DS, I would like to use the Huggingface API to use it. Either that or their wrapper package. If it can be done please let me know how, Thanks! :)

maziyarpanahi commented 1 month ago

Interesting, it seems to be CLIPModel architecture. @DevinTDHa can we do that currently, or should we put it on the roadmap?

DevinTDHa commented 4 weeks ago

Seems like it should work, if the underlying model doesn't have any architectural changes.

I'll try it out and report back!

DevinTDHa commented 4 weeks ago

The model works no problem in Spark NLP! Just follow this notebook to import the model properly:

https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_CLIP.ipynb

If you change the model name to patrickjohncyh/fashion-clip it should work. Let me know if you have any other questions.

txhno commented 4 weeks ago

@DevinTDHa @maziyarpanahi Thanks a lot! I will definitely try it out. I'll contact you if something pops up. :)

txhno commented 3 weeks ago

@DevinTDHa @maziyarpanahi I have a different question. Is it possible for me to use Spark NLP to compute CLIP embeddings instead of just using the ZeroShotClassification? My usecase is, taking a folder of images, and using Spark NLP to compute the embeddings of all the images in the folder and store them in a vector store, to later do retrieval tasks or similarity search.

Could I do something like this?

CLIP = ( CLIPForZeroShotClassification.loadSavedModel(f"{EXPORT_PATH}", spark) .setInputCols("image_assembler") .setOutputCol("embedding") ) image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler")

And could I do it without setting labels? CLIP.setCandidateLabels

DevinTDHa commented 3 weeks ago

Hi @txhno,

Sadly this is currently not possible, we would have to create this as a new feature. I don't think it would take that much time as most of it is already implemented. @maziyarpanahi perhaps we could fit this into one of the next releases?

maziyarpanahi commented 3 weeks ago

Thanks @txhno and @DevinTDHa. In fact, the idea was always to continue with using CLIP to have an annotator to convert image to embeddings, and another one to covert text into embeddings.

We will add these into our roadmap and I change this into feature request ticket.