Closed RussRobin closed 4 months ago
Hi! Yes we simply feed the 1536-dim features into the projector. We change the input dimension of projector from 768 to 1536.
The output dimension of projector is the same.
Thank you for ur reply! My confusion has been adequately addressed and ill close the issue.
Hi,
thank you for this great work!
In Table 1 of your paper, accuracy improvement is reported by adding S2 Scaling to LLaVA. As shown in Figure 1, the channel dimension of S2 Scaling is double the size of the original feature w/o S2 Scaling. Did you simply feed the 1536 channel feature into the projector in MLLM (I'm not sure how you changed the projector, is your output channel number of projector the same as LLaVA ?), or did you feed the 2 features(21616768) into projector separately (I think this will make num of token to 2576)?
Great thanks in advance.