Evaluation dataset split for ClothoAQA

NVIDIA / audio-flamingo

PyTorch implementation of Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities.

MIT License

173 stars 10 forks source link

Evaluation dataset split for ClothoAQA #6

Closed jasonppy closed 2 months ago

jasonppy commented 2 months ago

In table 2, clothoAQA is categorized into 3 subsets, however it's not clear to me how are non-binary and numerical split constructed, as for the same question, different annotators can give different answers. Say the same question got 3 different answers, do you merge them into one QA, or treat them as 3 different QA

Thanks

zhifengkongnv commented 2 months ago

Non-binary means the question isn't a yes-no question. You can get those questions by looking at whether all the ground truth answers are in ['yes', 'no'].

Numerical means the question is asking for a number. You can get those questions by looking at whether all the ground truth answers are numbers.

jasonppy commented 2 months ago

Thanks! For ClothoAQA, one question is answered by multiple workers and they might give different answers. For example, for question: "Is the area dry?", there might be 3 answers: "yes", "yes", "no". How was this handled in the evaluation? - do we treat them as three different QAs?

zhifengkongnv commented 2 months ago

The unanimous subset is where all three answers are the same (yes,yes,yes, or no,no,no). For the numerical and non-binary subsets, we call an answer correct if it hits any of the three answers.

jasonppy commented 2 months ago

Thanks! for numerical subset, is it possible to share the metadata or dataset split script? As there are some non-standard answers that might introduce ambiguities in dataset splitting. For example, answer "twentyfive" might not get classified as number, or whether "once" should be counted as a number

currently, if we only count those where all answers can be parsed by word2number, there are 138 examples in the test set.

And just to make sure our other metadata aligned, for unanimous, there are 1312 examples in test set, for non-binary there are 946 examples in test set.

Thanks for your time!

zhifengkongnv commented 2 months ago

The numerical subset are those questions that start with "how many". There are 195 such questions in the test split of Clotho-AQA. There are 684 unanimous-yes questions and 425 unanimous-no questions. There are 932 non-binary questions.

jasonppy commented 2 months ago

Thanks!

I was able to get 195 numerical QA following your comments. and 1109 unanimous yes and no questions. However, I got 945 non-binary questions rather than 932. I have manually checked all the QA in non-binary testsets and they are indeed non-binary. Is there are criteria that you used in addition to needing all answers from different workers to be not 'yes' nor 'no'?

zhifengkongnv commented 2 months ago

Use ps.stem to deal with typos.