Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
878 stars 55 forks source link

Multi image inference quality #79

Open oroojlooy opened 5 days ago

oroojlooy commented 5 days ago

When I pass two images for the inference, it looks like that the model just ignores one of the images and only talks about one image. Here is the script to replicate the result:

python  llava/eval/run_vila.py   --model-path Efficient-Large-Model/Llama-3-VILA1.5-8b     --conv-mode llama_3     --query "<image>\n Please describe the traffic condition.\n <image> how many kids are in the pic?"     --image-file "demo_images/av.png,demo_images/iStock-487419534.jpg"

I am getting the kids image from https://ecovillage.org/wp-content/uploads/2018/03/iStock-487419534.jpg Any suggestion?