[Question] Inference mode on InternVL2

OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

https://internvl.readthedocs.io/en/latest/

MIT License

5.98k stars 462 forks source link

[Question] Inference mode on InternVL2 #605

Closed XYxiyang closed 4 weeks ago

XYxiyang commented 1 month ago

Thank you for your elegant work! I am wondering if InternV2 has the same function like InternVL-C in the previous versions that support cross-modal feature retrieval, or how I can get aligned embeddings for image-text pairs? I have tested extracting features by calling 'model.extract_feature()' for images and 'model.language_model.get_input_embeddings()' for texts, but the generated embeddings show very low similarities. Thanks again for your precious time!

XYxiyang commented 1 month ago

In other words, can I simply get embeddings for images and texts separately that share high similarities?

xiexh20 commented 4 weeks ago

same question here

wangpichao commented 4 weeks ago

same question here. Or how to extract combined features?

XYxiyang commented 4 weeks ago

According to the authors, versions later than InternVL1 are trained solely by next token prediction; therefore, they no longer support embedding retrieval. Thanks a lot.