Closed rotem154154 closed 3 months ago
Hello,
I've been working with the InternViT-300M-448px model as described in your VITA paper. In the paper, it's mentioned that the visual encoder generates 256 tokens after using a visual connector, which is a simple two-layer MLP. However, when I run InternViT-300M-448px, I get 1025 tokens as output. These consist of the pooler token concatenated with 1024 tokens corresponding to 32x32 patches of size 14x14.
Could you please clarify how the number of tokens was reduced to 256 in your setup? Specifically, I'm curious about how the visual connector is configured to achieve this reduction, as it seems there's a discrepancy between the number of tokens I'm observing and what's mentioned in the paper.
Thank you for your assistance!
Following InternVL, we discard the class token and use pixel shuffle to concatenate every four adjacent tokens in the spatial dimension into one token in the channel dimension. This results in a final token count of (1025-1)/4=256.
Hello,
I've been working with the InternViT-300M-448px model as described in your VITA paper. In the paper, it's mentioned that the visual encoder generates 256 tokens after using a visual connector, which is a simple two-layer MLP. However, when I run InternViT-300M-448px, I get 1025 tokens as output. These consist of the pooler token concatenated with 1024 tokens corresponding to 32x32 patches of size 14x14.
Could you please clarify how the number of tokens was reduced to 256 in your setup? Specifically, I'm curious about how the visual connector is configured to achieve this reduction, as it seems there's a discrepancy between the number of tokens I'm observing and what's mentioned in the paper.
Thank you for your assistance!