ajseo95 / MASN-pytorch

pytorch implementation for the paper Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering
MIT License
2 stars 1 forks source link

Motion-Appearance Synergistic Networks for VideoQA (MASN)

Pytorch Implementation for the paper:

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering
Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, and Byoung-Tak Zhang
In ACL 2021

Requirements

python 3.7, pytorch 1.2.0

Dataset

Extract Features

  1. Appearance Features

    • For local features, we used the Faster-RCNN pre-trained with Visual Genome. Please cite this Link.
    • After you extracted object features by Faster-RCNN, you can convert them to hdf5 file with simple run: python adaptive_detection_features_converter.py
    • For global features, we used ResNet152 provided by torchvision. Please cite this Link.
  2. Motion Features

    • For local features, we use RoIAlign with bounding box features obtained from Faster-RCNN. Please cite this Link.
    • For global features, we use I3D pre-trained on Kinetics. Please cite this Link.

We uploaded our extracted features: 1) TGIF-QA

2) MSRVTT-QA

3) MSVD-QA

Training

Simple run

CUDA_VISIBLE_DEVICES=0 python main.py --task Count --batch_size 32

For MSRVTT-QA, run

CUDA_VISIBLE_DEVICES=0 python main_msrvtt.py --task MS-QA --batch_size 32

For MSVD-QA, run

CUDA_VISIBLE_DEVICES=0 python main_msvd.py --task MS-QA --batch_size 32

Saving model checkpoints

By default, our model save model checkpoints at every epoch. You can change the path for saving models by --save_path options. Each checkpoint's name is '[TASK]_[PERFORMANCE].pth' in default.

Evaluation & Results

CUDA_VISIBLE_DEVICES=0 python main.py --test --checkpoint [NAME] --task Count --batch_size 32

Performance on TGIF-QA dataset:

Model Count Action Trans. FrameQA
MASN 3.75 84.4 87.4 59.5

You can download our pre-trained model by this link : Count, Action, Trans., FrameQA

Performance on MSRVTT-QA and MSVD-QA dataset: Model MSRVTT-QA MSVD-QA
MASN 35.2 38.0

Citation

If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:

@inproceedings{seo-etal-2021-attend,
    title = "Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering",
    author = "Seo, Ahjeong  and
      Kang, Gi-Cheon  and
      Park, Joonhan  and
      Zhang, Byoung-Tak",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.481",
    doi = "10.18653/v1/2021.acl-long.481",
    pages = "6167--6177",
    abstract = "Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question{'}s intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN.",
}

License

MIT License

Acknowledgements

This work was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (2015-0-00310-SW.StarLab/25%, 2017-0-01772-VTT/25%, 2018-0-00622-RMI/25%, 2019-0-01371-BabyMind/25%) grant funded by the Korean government.