Very poor audio understanding

DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

BSD 3-Clause "New" or "Revised" License

2.77k stars 255 forks source link

I went on the model's demo and used the skateboard dog video. The audio for this video is a narrator talking. But when I prompt the model with "What do you hear in this video?" or "Ignore what you see in this video. What do you hear?", it gives me outputs like "I can hear the sound of a dog walking on a black and white skateboard.". I don't think it recognizes the audio at all.

Am I running the model wrong, or is this a limit of the model? (possibly because of the use of video during AL-branch training and hoping audio will work because of the use of ImageBind as encoder)

I'm trying to use the audio capability, without video, for something; wondering if there's a fix for this, thanks!

DAMO-NLP-SG / Video-LLaMA

Very poor audio understanding #134