YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
337 stars 27 forks source link

About the experimental results of the paper LTU-AS #4

Open yangyuxiang1996 opened 9 months ago

yangyuxiang1996 commented 9 months ago

Hello, I've been reading the LTU-AS paper recently, and I'm a bit confused about the ablation experiments mentioned in the paper. It states that using only spoken text as input during inference resulted in a WER of 20.0 on Librispeech. I'm wondering why it's so high because it seems like using the original whisper model for decoding shouldn't lead to such a significant performance drop. Thank you!

YuanGongND commented 9 months ago

Thanks for the question.

The LTU-AS model, is trained with two types of data - [continuous audio token, spoken text] or [continuous audio token only] (in the situation that the audio clip does not contain speech). It has never seen data like [spoken text only].

In the ablation study you mentioned, the input is spoken text only without continuous audio token, which is a mismatch with the training setting, which cause the model to occasionally not follow instruction for the ASR task, which leads to a high WER.
