Open KangsanKim07 opened 3 days ago
First we use siglip/internvit that will be 729/1024 tokens per image. Second, see section 4.4 of our paper https://arxiv.org/pdf/2312.07533 we reshape a 2x2 patch into 1 token to reduce the # of tokens by 4x. Code is here: https://github.com/NVlabs/VILA/blob/main/llava/model/multimodal_projector/base_projector.py#L33
Hi, thanks for sharing great work! I noticed VILA uses 8 frames for video, but if image token is 576 then the token length of 8 frames will be 4608 (576*8) which is over than 'model_max_length'(4096). It seems all VILA models except LLaMA3-8B one have max length as 4096. Could you explain how these models can accept 8 frames of video, please?