Clip fine tuning - Githubissues

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

14.82k stars 2.43k forks source link

Clip fine tuning #2866

Open capricixhk opened 1 month ago

capricixhk commented 1 month ago

I am trying to fine tune the clip model (clip-ViT-B-32-multilingual-v1). Is there example about training it with layers frozen? Also, can I train only the text encoder without modifying the image encoder? Thanks!

ir2718 commented 1 month ago

Hi,

here's an example of freezing the image encoder:

model = SentenceTransformer("clip-ViT-B-32")
for p in model.model.vision_model:
    p.requires_grad = False

Training this model would mean you only train the text encoder, which will probably yield lower test set scores.

Similarly, here's an example of freezing the first 4 layers of the text encoder:

for p in model[0].model.text_model.encoder.layers[0:4].parameters():
    p.requires_grad = False

Hope this helps.

km5ar commented 3 days ago

any one using clip for searching a large batch of PDF documents (legal docs)? is it good in this use case?

ir2718 commented 3 days ago

@km5ar

Hi,

not sure how many images of text/documents are present in datasets used for CLIP, but I don't think it's a lot. My best bet would be to try out something like Nougat/Donut + ColBERT/sentence transformer with paragraph chunking.