About the experimental results of the paper LTU-AS

YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".

389 stars 36 forks source link

Thanks for the question.

The LTU-AS model, is trained with two types of data - [continuous audio token, spoken text] or [continuous audio token only] (in the situation that the audio clip does not contain speech). It has never seen data like [spoken text only].

In the ablation study you mentioned, the input is spoken text only without continuous audio token, which is a mismatch with the training setting, which cause the model to occasionally not follow instruction for the ASR task, which leads to a high WER.

-Yuan

YuanGongND / ltu

About the experimental results of the paper LTU-AS #4