VITA-MLLM / VITA

✨✨VITA: Towards Open-Source Interactive Omni Multimodal LLM
Other
970 stars 59 forks source link

Discrepancy in the Number of Tokens Output by InternViT-300M-448px #16

Closed rotem154154 closed 3 months ago

rotem154154 commented 3 months ago

Hello,

I've been working with the InternViT-300M-448px model as described in your VITA paper. In the paper, it's mentioned that the visual encoder generates 256 tokens after using a visual connector, which is a simple two-layer MLP. However, when I run InternViT-300M-448px, I get 1025 tokens as output. These consist of the pooler token concatenated with 1024 tokens corresponding to 32x32 patches of size 14x14.

Could you please clarify how the number of tokens was reduced to 256 in your setup? Specifically, I'm curious about how the visual connector is configured to achieve this reduction, as it seems there's a discrepancy between the number of tokens I'm observing and what's mentioned in the paper.

Thank you for your assistance!

linhaojia13 commented 3 months ago

Hello,

I've been working with the InternViT-300M-448px model as described in your VITA paper. In the paper, it's mentioned that the visual encoder generates 256 tokens after using a visual connector, which is a simple two-layer MLP. However, when I run InternViT-300M-448px, I get 1025 tokens as output. These consist of the pooler token concatenated with 1024 tokens corresponding to 32x32 patches of size 14x14.

Could you please clarify how the number of tokens was reduced to 256 in your setup? Specifically, I'm curious about how the visual connector is configured to achieve this reduction, as it seems there's a discrepancy between the number of tokens I'm observing and what's mentioned in the paper.

Thank you for your assistance!

Following InternVL, we discard the class token and use pixel shuffle to concatenate every four adjacent tokens in the spatial dimension into one token in the channel dimension. This results in a final token count of (1025-1)/4=256.