Incorrect model inference (what went wrong in my setup)

video-llama

Hello,

1, I have set up Video-LLaMA from this repo. I have downloaded all checkpoints for inference:

I am using VL_LLAMA_2_7B_Finetuned.pth and llama-2-7b-chat-hf from this Hugging Face repo.
The BLIP2-q-former model is from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth
The VIT-g vision encoder is from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth

The Gradio Demo demo_video.py runs perfectly fine, there’s no errors for loading in all these vision encoder, BLIP2, LLAMA-2 weights.

2, However, the inference gives totally wrong answer and it seems that the model is not using the vision encoder. For example, asking the model about this following photo gives irrelevant answers.

3, I wonder what may have been wrong in my setup.

Thank you very much!!

DAMO-NLP-SG / Video-LLaMA

Incorrect model inference (what went wrong in my setup) #145