Closed DumplingLife closed 10 months ago
The model associated with the online demo does not activate the AL branch due to the limit of GPU memory, so you may have to download the checkpoints on your server (and launch the demo locally) to experience this feature. Nevertheless, since we don't use any audio-text data to train the AL branch, the audio understanding capability of our Video-LLaMA is still not that satisfactory.
I went on the model's demo and used the skateboard dog video. The audio for this video is a narrator talking. But when I prompt the model with "What do you hear in this video?" or "Ignore what you see in this video. What do you hear?", it gives me outputs like "I can hear the sound of a dog walking on a black and white skateboard.". I don't think it recognizes the audio at all.
Am I running the model wrong, or is this a limit of the model? (possibly because of the use of video during AL-branch training and hoping audio will work because of the use of ImageBind as encoder)
I'm trying to use the audio capability, without video, for something; wondering if there's a fix for this, thanks!