I'm new to LLM serving and multi-modal LLMs. I'm looking for similar examples for the LongVILA model, like the one for VILA1.5 models:
python -W ignore llava/eval/run_vila.py --model-path Efficient-Large-Model/VILA1.5-13b --conv-mode vicuna_v1 --query "<video>\n Describe what happened in the video." --video-file "./example.mp4" --num-video-frames 20
Specifically, I'd like to know what conv-mode I should use and the maximum frame number for both LongVILA and VILA1.5 models. I also noticed the paper mentioned a downsampler that can reduce the number of tokens for an image; do you have an example of how to use that?
Hello,
I'm new to LLM serving and multi-modal LLMs. I'm looking for similar examples for the LongVILA model, like the one for VILA1.5 models:
Specifically, I'd like to know what conv-mode I should use and the maximum frame number for both LongVILA and VILA1.5 models. I also noticed the paper mentioned a downsampler that can reduce the number of tokens for an image; do you have an example of how to use that?
Thanks!