Thanks for the great work!
I have a question about the paper.
I noticed that the paper says that llava-next needs 2880 visual tokens, but Cambrian only needs 576.
From what I understand, the calculation method of the number of visual tokens is only related to the number of patches.
Take CLIP-L-336/14 as an example:
A 336x336 pixels image is firstly divided into 14x14 patches, each with 24x24 pixels.
If a pixel has 3 channels(RGB), each patch is an 24x24x3 tensor.
Assuming the size of each token is 768, each patch will be then converted to an one-dimensional tensor with size 768 by a linear projection.
So what I understand is: the number of visual tokens == the number of patches.
I want to know where I misunderstood. I would appreciate an explanation. Thank you!
Thanks for the great work! I have a question about the paper. I noticed that the paper says that llava-next needs 2880 visual tokens, but Cambrian only needs 576. From what I understand, the calculation method of the number of visual tokens is only related to the number of patches. Take CLIP-L-336/14 as an example: A 336x336 pixels image is firstly divided into 14x14 patches, each with 24x24 pixels. If a pixel has 3 channels(RGB), each patch is an 24x24x3 tensor. Assuming the size of each token is 768, each patch will be then converted to an one-dimensional tensor with size 768 by a linear projection. So what I understand is: the number of visual tokens == the number of patches. I want to know where I misunderstood. I would appreciate an explanation. Thank you!