When I pass two images for the inference, it looks like that the model just ignores one of the images and only talks about one image. Here is the script to replicate the result:
python llava/eval/run_vila.py --model-path Efficient-Large-Model/Llama-3-VILA1.5-8b --conv-mode llama_3 --query "<image>\n Please describe the traffic condition.\n <image> how many kids are in the pic?" --image-file "demo_images/av.png,demo_images/iStock-487419534.jpg"
I am getting the kids image from https://ecovillage.org/wp-content/uploads/2018/03/iStock-487419534.jpg
Any suggestion?
When I pass two images for the inference, it looks like that the model just ignores one of the images and only talks about one image. Here is the script to replicate the result:
I am getting the kids image from
https://ecovillage.org/wp-content/uploads/2018/03/iStock-487419534.jpg
Any suggestion?