[Question about paper] The number of visual tokens

Thanks for the great work! I have a question about the paper. I noticed that the paper says that llava-next needs 2880 visual tokens, but Cambrian only needs 576. From what I understand, the calculation method of the number of visual tokens is only related to the number of patches. Take CLIP-L-336/14 as an example: A 336x336 pixels image is firstly divided into 14x14 patches, each with 24x24 pixels. If a pixel has 3 channels(RGB), each patch is an 24x24x3 tensor. Assuming the size of each token is 768, each patch will be then converted to an one-dimensional tensor with size 768 by a linear projection. So what I understand is: the number of visual tokens == the number of patches. I want to know where I misunderstood. I would appreciate an explanation. Thank you!

cambrian-mllm / cambrian

[Question about paper] The number of visual tokens #8