Single modality input on Video-salmonn

qixueweigitbub commented 1 week ago

Dear authors. thanks for your great job and contribution to the research community.

In my use case, I need use video-salmonn model for reasoning on audio file only. I know I can use original SALMONN model for audio reasoning, but my deployment can not have two large models, so is it possible to just input audio file to video-salmonn and get outputs?

And, if Yes, will the performance on audio modality similar to SALMONN, which is trained for audio modality?

BriansIDP commented 1 week ago

To use audio-only mode, please first modify video_salmonn/config/test.yaml line 7-10 to

all_decode_info: [
  ["audio", "audio_input", "Your example audio-only json file"]
]

Then in your audio-only json file, please use the same format as example.json but you only need to provide one path for "image_name", e.g. one data item could be the following:

{
        "image_name": "./dummy/4405327307.wav",
        "conversation": [
            {
                "from": "human",
                "value": "Describe the audio in detail"
            },
            {
                "from": "gpt",
                "value": "None"
            }
        ]
    }

The performance is worse and is less robust to noise than SALMONN because we use much smaller audio/speech training data than SALMONN. Please compare the ASR/AAC numbers in both papers to understand their performance differences.

TCL606 commented 1 week ago

Feel free to reopen this issue if there are still problems.

bytedance / SALMONN

Single modality input on Video-salmonn #77