Closed zsw111-zzz closed 1 year ago
Thank you very much for your interest in AQA-related work.
We have noticed the issue of being unable to reproduce the results of individual word predictions in the Clotho AQA paper [1]. Considering the limited nature of the Clotho-AQA dataset and the presence of 828 candidate answers, it is indeed challenging to achieve the reported top-1 result of 54.2% in [1].
Since the audio data in Clotho-AQA is sourced from the real world, we aim to explore sound scene understanding based on natural sounds. Therefore, we examined the original annotation files of Clotho-AQA and found that the official open-source annotations were not cleansed, resulting in discrepancies where different annotators provided different answers for the same question. As a result, we performed a simple filtering process where we considered a question to have the correct answer if it had at least two identical answers, disregarding other cases and excluding "yes" and "no" scenarios. Based on this filtering process, we obtained a new and more accurate annotation file and replicated the AqualNet method from [1], achieving a top-1 result of 14.78%.
Additionally, we have uploaded the filtered annotation file to the github repo. If you have any questions, please feel free to contact us via email.
Your reply perfectly solved my problem, thank you very much for your reply!
Thank you very much for your outstanding contribution to the open source community, but I noticed that your evaluation index is different from the paper that proposed the Clotho-AQA dataset, namely Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering. They claim that they have achieved an accuracy rate higher than 0.6. I don't know if this is because this article mixes the "yes" and "no" binary labels with other multi-dimensional labels. I hope the author can further explain, Thanks again for your contribution!