Model keeps output "there is no sound/ I can not hear anything" when there is actual sound

DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Apache License 2.0

849 stars 58 forks source link

Model keeps output "there is no sound/ I can not hear anything" when there is actual sound #80

Closed qixueweigitbub closed 1 week ago

qixueweigitbub commented 2 months ago

Thanks for the great work!

I tested different videos with sound (speech, music, noise etc) using the online demo, but the model keeps ignore the sound information no matter how I asked in the prompt explicitly or implicitly. Is it because the audio file of the video is not loaded properly? Can you help provide any hints here? Anyone else has the same issue?

lixin4ever commented 2 months ago

Sorry for the confusion. The currently available models do not include audio branch and therefore they will not take audio as input.

xinyifei99 commented 1 week ago

Thanks for your attention! You can switch to the audio_visual branch (https://github.com/DAMO-NLP-SG/VideoLLaMA2/tree/audio_visual) and clone the repository to run inference for audio related tasks.