Can this model perform cross-modal retrieval tasks?

X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

https://www.modelscope.cn/studios/damo/mPLUG-Owl

MIT License

2.25k stars 171 forks source link

Can this model perform cross-modal retrieval tasks? #74

Closed ljwdust closed 1 year ago

ljwdust commented 1 year ago

Can this model extract image and text features, like CLIP, and perform image-to-text or text-to-image retrieval tasks? If yes, how to extract these features? Thank you!

MAGAer13 commented 1 year ago

Yes, you can simply extract the output of the visual abstractor and the LLM hidden features as the embedding. But we have not tried to do this. So you should use at your own risk.