TIGER-AI-Lab / VLM2Vec

This repo contains the code and data for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks"
https://tiger-ai-lab.github.io/VLM2Vec/
Apache License 2.0
80 stars 1 forks source link

Which Layer's Output are used for Contrastive Training #4

Closed VincentVanNF closed 2 weeks ago

VincentVanNF commented 2 weeks ago

Hello,as mentioned in Readme.md, "The basic idea is to add an [EOS] token in the end of the sequence". And in paper it takes the last layer vector representation of the last token. I'm confused that which layer's output is used? Last layer of encoder of LLM or the last layer of whole LLM which means last layer of decoder? In the figure,it seems like it takes the decoder's last layer of the LLM to get the [EOS] token vector. Thanks for your answer~

wenhuchen commented 2 weeks ago

Sorry for the confusion. The readme is outdated. It should be the representation of the last token instead of EOS. It's using the last layer of the decoder. Phi-3 is a decoder only model, so there is no encoder at all.

VincentVanNF commented 2 weeks ago

Sorry for the confusion. The readme is outdated. It should be the representation of the last token instead of EOS. It's using the last layer of the decoder. Phi-3 is a decoder only model, so there is no encoder at all.

Thank you very much for your answer. I have another question: have you conducted comparative experiments on using large language models directly for SFT fine-tuning versus using embeddings specifically for different downstream tasks, especially classification tasks? If so, what were the results?

wenhuchen commented 2 weeks ago

Hi there, LLM cannot do the retrieval tasks. Yes, the classification tasks are doable, but accuracy is relatively lower than embedding model.

VincentVanNF commented 2 weeks ago

accuracy is relatively lower than embedding model.

Sorry for the misunderstanding. Assuming that we focus only on classification tasks, the final conclusion is: "The accuracy of LLM in direct generative classification is relatively lower than that of the embedding model(VLM2Vec)." Has this conclusion been validated by the VLM2Vec framework? I am currently exploring the classification capabilities of LLM fine-tuning in a specific scenario. If the above conclusion has been validated, I think I should shift my approach to using VLM2Vec rather than the generative method.

wenhuchen commented 2 weeks ago

Our experience is that if the classification label space is huge (like > 100 labels), VLM2Vec is much better than LLMs. We believe the dot-product is still the way to go in classification.

VincentVanNF commented 2 weeks ago

Okay, I have a general understanding of the specific situation. I will verify this in my task. Thank you very much for your interpretation and response. Your reply has been very helpful to me.