YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
337 stars 27 forks source link

Question about the Realism of Simulated Acoustic Event Combinations in Data Generation #5

Open haoxiangsnr opened 7 months ago

haoxiangsnr commented 7 months ago

Hi, @YuanGongND, thanks for the excellent work. I have carefully read through your paper and I am intrigued by the methodology you employed in generating simulation data. The approach of mixing multiple acoustic events to create audio scenarios and constructing both closed and open-ended answers based on the attributes of these events is particularly interesting.

I have a question regarding the realism of the acoustic event combinations. Have you considered the rationality of these combinations? For instance, a random combination might result in a mix that includes wind, coffee shop noise, ambulance sirens, and insect sounds. Such a combination is highly unlikely to occur in a real-world scenario.

Despite the rarity or even the implausibility of these combinations, ChatGPT will generate a variety of questions and provide reasoning-based answers based on these sounds, regardless of their likelihood. Have you considered the potential impact of this aspect on the system's performance or the validity of the generated data?

I am curious to know if there were any steps taken to address the possibility of generating unrealistic or improbable acoustic scenarios and how the system might differentiate or handle these cases.

Thank you for your attention to this matter.

YuanGongND commented 7 months ago

hi there,

Thanks so much for your interest.

I have a question regarding the realism of the acoustic event combinations. Have you considered the rationality of these combinations? For instance, a random combination might result in a mix that includes wind, coffee shop noise, ambulance sirens, and insect sounds. Such a combination is highly unlikely to occur in a real-world scenario.

I think one misunderstanding here is that we only use real datasets such as Audioset (which is from Youtube videos). So the combination of audio events (and with the spoken text in LTU-AS) is real, not synthesized.

Despite the rarity or even the implausibility of these combinations, ChatGPT will generate a variety of questions and provide reasoning-based answers based on these sounds, regardless of their likelihood. Have you considered the potential impact of this aspect on the system's performance or the validity of the generated data?

GPT will find a reasonable explanation on the combination, if it is really rare, I think it will say it is rare. But again, all data we use are real.

-Yuan

YuanGongND commented 7 months ago

btw, the combination you mentioned is also referred as "co-occurrence" in some literature, check this https://ieeexplore.ieee.org/abstract/document/9178483. Some research explicitly model such combination.