The Gradio Demo demo_video.py runs perfectly fine, there’s no errors for loading in all these vision encoder, BLIP2, LLAMA-2 weights.
2, However, the inference gives totally wrong answer and it seems that the model is not using the vision encoder. For example, asking the model about this following photo gives irrelevant answers.
Hello,
1, I have set up Video-LLaMA from this repo. I have downloaded all checkpoints for inference:
The Gradio Demo
demo_video.py
runs perfectly fine, there’s no errors for loading in all these vision encoder, BLIP2, LLAMA-2 weights.2, However, the inference gives totally wrong answer and it seems that the model is not using the vision encoder. For example, asking the model about this following photo gives irrelevant answers.
3, I wonder what may have been wrong in my setup.
Thank you very much!!