Open YoungjaeDev opened 3 months ago
Hi @YoungjaeDev - VILA is adaptable how many images it can understand and how you can prompt it. Although generally I've found a max of 8 images to be good. You can find an example of multi-image input for video/action comprehension here:
https://github.com/dusty-nv/NanoLLM/blob/main/nano_llm/vision/video.py https://www.jetson-ai-lab.com/tutorial_live-llava.html#video-vila
basically you just construct the chat history with how many images and prompts that you want.
@dusty-nv
Thank you.
The image resolution is 336, and you didn't use token compression, right?
@YoungjaeDev it depends on the model, but IIRC VILA 1.5 is using SigLIP 384x384 vision encoder, and the 3B model compresses it down to ~192 image tokens using the projection layers, which the authors found the spatial resolution to be more impactful than the number of image tokens.
@YoungjaeDev it depends on the model, but IIRC VILA 1.5 is using SigLIP 384x384 vision encoder, and the 3B model compresses it down to ~192 image tokens using the projection layers, which the authors found the spatial resolution to be more impactful than the number of image tokens.
I've only read the original VILA paper, and I believe version 1.5 hasn't been released yet. Is there any technical report or blog post available for version 1.5?
@YoungjaeDev I believe the original VILA paper was updated, and there also these technical blogs with lots of details:
Also the team recently published the VILA^2 paper about their future models - https://arxiv.org/abs/2407.17453
Hello, I watched a video about Live ViLA and was impressed by the 3B model running on the edge. Regarding this, I'm curious about how frame sequences are processed during video understanding.
I would appreciate a brief explanation or a link to a relevant technical blog post about this. Thank you for your help.