YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
337 stars 27 forks source link

How to process audio that exceeds 10 seconds in length #12

Open qisawO3 opened 5 months ago

qisawO3 commented 5 months ago

Hello, I would like to ask, how do you test the audio in the LibriSpeech dataset that exceeds 10 seconds in duration?I'm encountering an issue while using the LibriSpeech dataset for speech recognition. Despite modifying the pad_or_trim method in the transcribe_audio function, I still face a problem with mismatched model dimensions. How can I resolve this issue?

YuanGongND commented 5 months ago

hi there,

It is non-trivial to adapt the model for a different length, and we didn't do it.

For ASR eval, in LTU-AS, we input the full transcription from Whisper and the first 10-second audio features (from Whisper Encoder) to the LLaMA model.

-Yuan

qisawO3 commented 5 months ago

Thank you for your response. I would like to ask further, for audio that exceeds 10 seconds in length, if only the first 10 seconds of audio features are used for ASR, then when calculating the WER during testing, should the label content also be truncated to the first 10 seconds?

YuanGongND commented 5 months ago

What we did is inputing the first 10-second audio feature (from Whisper-encoder) and full Whisper transcription to the LLM. The transcription can be as long as possible (not limited to 10s).

-Yuan

qisawO3 commented 5 months ago

So, let me give an example. My question is: What this audio segment is saying?does LTU-AS output the entire text, and not just the text contained within those 10 seconds of speech? In other words, is the ASR performance of LTU-AS entirely provided by Whisper?

YuanGongND commented 5 months ago

What this audio segment is saying? does LTU-AS output the entire text, and not just the text contained within those 10 seconds of speech?

No, LTU-AS almost never cut and output the spoken text in the first 10-second. In fact, it does not know what speech is in the first 10 seconds, as its acoustic input is Whisper encoder output, without the decoder, it cannot do ASR.

In other words, is the ASR performance of LTU-AS entirely provided by Whisper?

While it copies the Whisper output in most cases, it occasionally not follow instruction or add something not in the spoken text, which makes its WER actually higher than Whisper output. Please check our paper in detail.

-Yuan