Question about Video Frame Processing in Live ViLA

dusty-nv / NanoLLM

Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.

https://dusty-nv.github.io/NanoLLM/

MIT License

192 stars 29 forks source link

Question about Video Frame Processing in Live ViLA #25

Open YoungjaeDev opened 3 months ago

YoungjaeDev commented 3 months ago

Hello, I watched a video about Live ViLA and was impressed by the 3B model running on the edge. Regarding this, I'm curious about how frame sequences are processed during video understanding.

How does Live ViLA process video frames?
For example, assuming there are 0-16 frames, does it process frames 0-7 and then 8-15 sequentially? (whether there are key frame intervals or overlapping sequences. Since it's very real-time, I'm interested in the internal logic)

I would appreciate a brief explanation or a link to a relevant technical blog post about this. Thank you for your help.

dusty-nv commented 3 months ago

Hi @YoungjaeDev - VILA is adaptable how many images it can understand and how you can prompt it. Although generally I've found a max of 8 images to be good. You can find an example of multi-image input for video/action comprehension here:

https://github.com/dusty-nv/NanoLLM/blob/main/nano_llm/vision/video.py https://www.jetson-ai-lab.com/tutorial_live-llava.html#video-vila

basically you just construct the chat history with how many images and prompts that you want.

YoungjaeDev commented 2 months ago

@dusty-nv

Thank you.

The image resolution is 336, and you didn't use token compression, right?

dusty-nv commented 2 months ago

@YoungjaeDev it depends on the model, but IIRC VILA 1.5 is using SigLIP 384x384 vision encoder, and the 3B model compresses it down to ~192 image tokens using the projection layers, which the authors found the spatial resolution to be more impactful than the number of image tokens.

YoungjaeDev commented 2 months ago

@YoungjaeDev it depends on the model, but IIRC VILA 1.5 is using SigLIP 384x384 vision encoder, and the 3B model compresses it down to ~192 image tokens using the projection layers, which the authors found the spatial resolution to be more impactful than the number of image tokens.

I've only read the original VILA paper, and I believe version 1.5 hasn't been released yet. Is there any technical report or blog post available for version 1.5?

dusty-nv commented 2 months ago

@YoungjaeDev I believe the original VILA paper was updated, and there also these technical blogs with lots of details:

Also the team recently published the VILA^2 paper about their future models - https://arxiv.org/abs/2407.17453