Closed XYxiyang closed 4 weeks ago
In other words, can I simply get embeddings for images and texts separately that share high similarities?
same question here
same question here. Or how to extract combined features?
According to the authors, versions later than InternVL1 are trained solely by next token prediction; therefore, they no longer support embedding retrieval. Thanks a lot.
Thank you for your elegant work! I am wondering if InternV2 has the same function like InternVL-C in the previous versions that support cross-modal feature retrieval, or how I can get aligned embeddings for image-text pairs? I have tested extracting features by calling 'model.extract_feature()' for images and 'model.language_model.get_input_embeddings()' for texts, but the generated embeddings show very low similarities. Thanks again for your precious time!