Can this model extract image and text features, like CLIP, and perform image-to-text or text-to-image retrieval tasks? If yes, how to extract these features?
Thank you!
Yes, you can simply extract the output of the visual abstractor and the LLM hidden features as the embedding. But we have not tried to do this. So you should use at your own risk.
Can this model extract image and text features, like CLIP, and perform image-to-text or text-to-image retrieval tasks? If yes, how to extract these features? Thank you!