QwenLM / Qwen-VL

The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Other
4.27k stars 327 forks source link

Discussion closed #411

Closed MaHuanAAA closed 4 weeks ago

MaHuanAAA commented 4 weeks ago

For most VLMs, I find that the image feature representations are fixed. Although cross-attention is mentioned in the paper of QWen-VL, it appears to be just an adapter in practice. Why don't these models use the text question as a ``q'' to perform cross-attention for image feature extraction? There must be some unacceptable drawbacks to this approach.

Can anyone explain the reason behind this?

KDD2018 commented 4 weeks ago

+1