For most VLMs, I find that the image feature representations are fixed. Although cross-attention is mentioned in the paper of QWen-VL, it appears to be just an adapter in practice. Why don't these models use the text question as a ``q'' to perform cross-attention for image feature extraction? There must be some unacceptable drawbacks to this approach.
For most VLMs, I find that the image feature representations are fixed. Although cross-attention is mentioned in the paper of QWen-VL, it appears to be just an adapter in practice. Why don't these models use the text question as a ``q'' to perform cross-attention for image feature extraction? There must be some unacceptable drawbacks to this approach.
Can anyone explain the reason behind this?