bfshi / scaling_on_scales

When do we not need larger vision models?
MIT License
277 stars 9 forks source link

Num of tokens in LLaVA #4

Closed RussRobin closed 4 months ago

RussRobin commented 4 months ago

Hi,

thank you for this great work!

In Table 1 of your paper, accuracy improvement is reported by adding S2 Scaling to LLaVA. As shown in Figure 1, the channel dimension of S2 Scaling is double the size of the original feature w/o S2 Scaling. Did you simply feed the 1536 channel feature into the projector in MLLM (I'm not sure how you changed the projector, is your output channel number of projector the same as LLaVA ?), or did you feed the 2 features(21616768) into projector separately (I think this will make num of token to 2576)?

Great thanks in advance.

bfshi commented 4 months ago

Hi! Yes we simply feed the 1536-dim features into the projector. We change the input dimension of projector from 768 to 1536.

bfshi commented 4 months ago

The output dimension of projector is the same.

RussRobin commented 4 months ago

Thank you for ur reply! My confusion has been adequately addressed and ill close the issue.