DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
BSD 3-Clause "New" or "Revised" License
2.77k stars 255 forks source link

Very poor audio understanding #134

Closed DumplingLife closed 10 months ago

DumplingLife commented 10 months ago

I went on the model's demo and used the skateboard dog video. The audio for this video is a narrator talking. But when I prompt the model with "What do you hear in this video?" or "Ignore what you see in this video. What do you hear?", it gives me outputs like "I can hear the sound of a dog walking on a black and white skateboard.". I don't think it recognizes the audio at all.

Am I running the model wrong, or is this a limit of the model? (possibly because of the use of video during AL-branch training and hoping audio will work because of the use of ImageBind as encoder)

I'm trying to use the audio capability, without video, for something; wondering if there's a fix for this, thanks!

lixin4ever commented 10 months ago

The model associated with the online demo does not activate the AL branch due to the limit of GPU memory, so you may have to download the checkpoints on your server (and launch the demo locally) to experience this feature. Nevertheless, since we don't use any audio-text data to train the AL branch, the audio understanding capability of our Video-LLaMA is still not that satisfactory.