YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
337 stars 27 forks source link

Questions about data construction #1

Open zengxijuan opened 11 months ago

zengxijuan commented 11 months ago

Hello, thank you for your excellent work. I have a few questions about data construction:

  1. How do different data sets allocate the proportion to generate QA pairs? For example, how does AudioSet data determine which audio segments are used to generate Classification data and which audio segments are used to generate Acoustic Features data?
  2. For the question construction of closed set data, since it is generated by GPT, will there be repeated questions? Do you generate a set of problems and then randomly select them, or do you call the interface for each segment?
  3. When processing data sets, how to deal with the case of data intersection between different data sets? Look forward to your reply, thank you!
YuanGongND commented 11 months ago

hi there,

How do different data sets allocate the proportion to generate QA pairs? For example, how does AudioSet data determine which audio segments are used to generate Classification data and which audio segments are used to generate Acoustic Features data?

Usually, we generate all possible qa for each sample, e.g., for AudioSet, almost all samples have a question about classification and a question about the feature, respectively.

For the question construction of closed set data, since it is generated by GPT, will there be repeated questions? Do you generate a set of problems and then randomly select them, or do you call the interface for each segment?

For closed-ended questions, yes, there are (many) repeat questions. Our closed-ended data is at a million level, so it is impossible/not necessary to have different questions for each closed-ended task. Practically, we paraphrase each closed-ended question ten to a hundred times using GPT.

When processing data sets, how to deal with the case of data intersection between different data sets?

There are overlapped audios (not many), but it is not a big issue. When it is from different datasets, it usually has different types of annotation. We just treat them as independent audio samples.

-Yuan