NVlabs / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
2.01k stars 161 forks source link

Context size and examples for LongVILA #141

Open yulinzou opened 1 month ago

yulinzou commented 1 month ago

Hello,

I'm new to LLM serving and multi-modal LLMs. I'm looking for similar examples for the LongVILA model, like the one for VILA1.5 models:

python -W ignore llava/eval/run_vila.py  --model-path Efficient-Large-Model/VILA1.5-13b    --conv-mode vicuna_v1    --query "<video>\n Describe what happened in the video."   --video-file "./example.mp4"  --num-video-frames 20

Specifically, I'd like to know what conv-mode I should use and the maximum frame number for both LongVILA and VILA1.5 models. I also noticed the paper mentioned a downsampler that can reduce the number of tokens for an image; do you have an example of how to use that?

Thanks!

Lyken17 commented 3 days ago

@yukang2017 can you help confirm context length? I think the conv mode should be llama3.