Question：Why are the prompts for training and inference for audio event classification are different?

peggyxpxu commented 5 months ago

Hi,sir: I find the prompts for training and testing for audio event classification are different in the code. In the train task ”cla_label”, one example of the question is "Identify the audio’s noise? Produce solely audio identifiers.”，these questions are all directly related to classification tasks. But in inference, all audio event classification questions are asked in the form of audio captions, for example 'Write an audio caption describing the sound?'. May I ask why different questions are used during training and testing? Why not use the same type of prompt as during training? Will this not affect the test effect? Thanks!

YuanGongND commented 5 months ago

hi there, thanks for the question.

We mentioned this in the paper, page 6, Section 5.1, under subsection "audio classification":

We tested two prompts “classify the sound events in the audio clip” and “write an audio caption describing the sound”, while both led to good results in our subjective evaluation, the latter led to better text embedding for the automatic evaluation framework and is used for benchmarking.

May I ask why different questions are used during training and testing? Why not use the same type of prompt as during training?

Training also has classification and captioning. We just benchmark it using the captioning benchmark for better performance.

Will this not affect the test effect?

As we mentioned in the paper, the captioning prompt leads to better performance as it encourages the model to say more about the sound. Classification prompt typically gets a concise class name. The LTU model is an open-ended model and often use synonym of the class name as answer. In that case, it is hard to fairly benchmark it.

-Yuan

peggyxpxu commented 5 months ago

hi there, thanks for the question.

We mentioned this in the paper, page 6, Section 5.1, under subsection "audio classification":

We tested two prompts “classify the sound events in the audio clip” and “write an audio caption describing the sound”, while both led to good results in our subjective evaluation, the latter led to better text embedding for the automatic evaluation framework and is used for benchmarking.

May I ask why different questions are used during training and testing? Why not use the same type of prompt as during training?

Training also has classification and captioning. We just benchmark it using the captioning benchmark for better performance.

Will this not affect the test effect?

As we mentioned in the paper, the captioning prompt leads to better performance as it encourages the model to say more about the sound. Classification prompt typically gets a concise class name. The LTU model is an open-ended model and often use synonym of the class name as answer. In that case, it is hard to fairly benchmark it.

-Yuan

I understand,Thanks!

YuanGongND / ltu

Question：Why are the prompts for training and inference for audio event classification are different? #34