GeWu-Lab / MWAFM

Multi-Scale Attention for Audio Question Answering
27 stars 1 forks source link

Audio Question Answering (AQA)

PyTorch code accompanies our Interspeech 2023 paper:

Multi-Scale Attention for Audio Question Answering [arXiv]

Guangyao Li, Yixin Xu and Di Hu


Requirements

python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg

Usage

  1. Clone this repo

    https://github.com/GeWu-Lab/MWAFM.git
  2. Download data

    Clotho-AQA and AQA-MUSIC-AVQA

  3. Data pre-processing

    We follow exact the same setting data format as MUSIC AVQA.

    Notice: We examined the original annotation files of Clotho-AQA and found that the official open-source annotations were not cleansed, resulting in discrepancies where different annotators provided different answers for the same question. As a result, we performed a simple filtering process where we considered a question to have the correct answer if it had at least two identical answers Based on this filtering process, we obtained a new and more accurate annotation file. The files in 'metadata' folder are described as follows

    • 'singleword\[train/val/test].csv', Does not contain samples with answers yes and no.
    • 'singleword\[train/val/test]_clean.csv', Does not contain samples with answers yes and no. (Cleaned data)
    • 'clothoaqa\[train/val/test]_clean.csv', Contains samples with answers yes and no. (Cleaned data)
    • 'binary_[train/val/test]_clean.csv', Include only samples with answers yes and no. (Cleaned data)
  4. Train and evaluate

    Training

    python main_MWAFM.py --mode train

    Testing

    python main_MWAFM.py --mode test

Citation

If you find this work useful, please consider citing it.


@ARTICLE{Li2023MultiScale,
  title = {Multi-Scale Attention for Audio Question Answering},
  author    = {Guangyao li, Yixin Xu, Di Hu},
  journal   = {Proc. INTERSPEECH},
  year  = {2023},
}

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.