DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
BSD 3-Clause "New" or "Revised" License
2.83k stars 263 forks source link

Incorrect model inference (what went wrong in my setup) #145

Open jennyziyi-xu opened 9 months ago

jennyziyi-xu commented 9 months ago

video-llama

Hello,

1, I have set up Video-LLaMA from this repo. I have downloaded all checkpoints for inference:

The Gradio Demo demo_video.py runs perfectly fine, there’s no errors for loading in all these vision encoder, BLIP2, LLAMA-2 weights.

2, However, the inference gives totally wrong answer and it seems that the model is not using the vision encoder. For example, asking the model about this following photo gives irrelevant answers.

3, I wonder what may have been wrong in my setup.

Thank you very much!!