Hello! I am trying to use Qwen-VL to extract unimodal features for a given input image and accompanying text query. How can that be achieved? I am aware that models like BLIP-2 have a direct API (extract_features) that aids in doing this. But how can it be achieved in the context of Qwen-VL?
Exactly what I was about to query. How do we get encoder embedding from Qwen2-VL for text and/or image or image/text combined input --> feature extracted.
Hello! I am trying to use Qwen-VL to extract unimodal features for a given input image and accompanying text query. How can that be achieved? I am aware that models like BLIP-2 have a direct API (extract_features) that aids in doing this. But how can it be achieved in the context of Qwen-VL?