Open yangyuxiang1996 opened 1 year ago
Thanks for the question.
The LTU-AS model, is trained with two types of data - [continuous audio token, spoken text] or [continuous audio token only] (in the situation that the audio clip does not contain speech). It has never seen data like [spoken text only].
In the ablation study you mentioned, the input is spoken text only without continuous audio token, which is a mismatch with the training setting, which cause the model to occasionally not follow instruction for the ASR task, which leads to a high WER.
-Yuan
Hello, I've been reading the LTU-AS paper recently, and I'm a bit confused about the ablation experiments mentioned in the paper. It states that using only spoken text as input during inference resulted in a WER of 20.0 on Librispeech. I'm wondering why it's so high because it seems like using the original whisper model for decoding shouldn't lead to such a significant performance drop. Thank you!