Context size and examples for LongVILA

Hello,

I'm new to LLM serving and multi-modal LLMs. I'm looking for similar examples for the LongVILA model, like the one for VILA1.5 models:

python -W ignore llava/eval/run_vila.py  --model-path Efficient-Large-Model/VILA1.5-13b    --conv-mode vicuna_v1    --query "<video>\n Describe what happened in the video."   --video-file "./example.mp4"  --num-video-frames 20

Specifically, I'd like to know what conv-mode I should use and the maximum frame number for both LongVILA and VILA1.5 models. I also noticed the paper mentioned a downsampler that can reduce the number of tokens for an image; do you have an example of how to use that?

Thanks!

NVlabs / VILA

Context size and examples for LongVILA #141