PyTorch code accompanies our Interspeech 2023 paper:
Multi-Scale Attention for Audio Question Answering [arXiv]
Guangyao Li, Yixin Xu and Di Hu
python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
Clone this repo
https://github.com/GeWu-Lab/MWAFM.git
Download data
Data pre-processing
We follow exact the same setting data format as MUSIC AVQA.
Notice: We examined the original annotation files of Clotho-AQA and found that the official open-source annotations were not cleansed, resulting in discrepancies where different annotators provided different answers for the same question. As a result, we performed a simple filtering process where we considered a question to have the correct answer if it had at least two identical answers Based on this filtering process, we obtained a new and more accurate annotation file. The files in 'metadata' folder are described as follows
Train and evaluate
Training
python main_MWAFM.py --mode train
Testing
python main_MWAFM.py --mode test
If you find this work useful, please consider citing it.
@ARTICLE{Li2023MultiScale,
title = {Multi-Scale Attention for Audio Question Answering},
author = {Guangyao li, Yixin Xu, Di Hu},
journal = {Proc. INTERSPEECH},
year = {2023},
}
This research was supported by Public Computing Cloud, Renmin University of China.