Multi image inference quality

When I pass two images for the inference, it looks like that the model just ignores one of the images and only talks about one image. Here is the script to replicate the result:

python  llava/eval/run_vila.py   --model-path Efficient-Large-Model/Llama-3-VILA1.5-8b     --conv-mode llama_3     --query "<image>\n Please describe the traffic condition.\n <image> how many kids are in the pic?"     --image-file "demo_images/av.png,demo_images/iStock-487419534.jpg"

I am getting the kids image from https://ecovillage.org/wp-content/uploads/2018/03/iStock-487419534.jpg Any suggestion?

NVlabs / VILA

Multi image inference quality #79