NVlabs / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
922 stars 63 forks source link

How VILA can handle 8 frames from videos? #83

Open KangsanKim07 opened 3 days ago

KangsanKim07 commented 3 days ago

Hi, thanks for sharing great work! I noticed VILA uses 8 frames for video, but if image token is 576 then the token length of 8 frames will be 4608 (576*8) which is over than 'model_max_length'(4096). It seems all VILA models except LLaMA3-8B one have max length as 4096. Could you explain how these models can accept 8 frames of video, please?

yaolug commented 3 days ago

First we use siglip/internvit that will be 729/1024 tokens per image. Second, see section 4.4 of our paper https://arxiv.org/pdf/2312.07533 we reshape a 2x2 patch into 1 token to reduce the # of tokens by 4x. Code is here: https://github.com/NVlabs/VILA/blob/main/llava/model/multimodal_projector/base_projector.py#L33