PKU-YuanGroup / MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models
https://arxiv.org/abs/2401.15947
Apache License 2.0
1.9k stars 121 forks source link

[Question] Image patch representation in this work #43

Closed cydiachen closed 6 months ago

cydiachen commented 6 months ago

Question

Hello. Firstly I will thank your assistance in debugging Qwen1.5 problem. I have achieved remarkable performance on Qwen1.5. I am now working on intergrating your codebase with LLaVA-Next (Aiming to intergrate the high-resolution support). I am now came up with a question about image patch representation of your code.

As is shown in Official LLaVA repo, the image feature map are flatten explicitly. But in your implementation, I did not find any operation to flatten image features. I am curious about the organization of image features in your work. image image

LinB203 commented 6 months ago

The output of the clip image encoder is already flat tokens.