cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
https://cambrian-mllm.github.io/
Apache License 2.0
1.4k stars 88 forks source link

[Question about paper] The number of visual tokens #8

Closed terry-for-github closed 6 days ago

terry-for-github commented 6 days ago

Thanks for the great work! I have a question about the paper. I noticed that the paper says that llava-next needs 2880 visual tokens, but Cambrian only needs 576. From what I understand, the calculation method of the number of visual tokens is only related to the number of patches. Take CLIP-L-336/14 as an example: A 336x336 pixels image is firstly divided into 14x14 patches, each with 24x24 pixels. If a pixel has 3 channels(RGB), each patch is an 24x24x3 tensor. Assuming the size of each token is 768, each patch will be then converted to an one-dimensional tensor with size 768 by a linear projection. So what I understand is: the number of visual tokens == the number of patches. I want to know where I misunderstood. I would appreciate an explanation. Thank you!

terry-for-github commented 6 days ago

Oh I understand! The "14" in "CLIP-L-336/14" means each patch has 14x14 instead of 24x24 pixels. So my idea is correct but the calculation is wrong.