KaihuaTang / VQA2.0-Recent-Approachs-2018.pytorch

A pytroch reimplementation of "Bilinear Attention Network", "Intra- and Inter-modality Attention", "Learning Conditioned Graph Structures", "Learning to count object", "Bottom-up top-down" for Visual Question Answering 2.0
GNU General Public License v3.0
295 stars 57 forks source link

Several Recent Approaches (2018) on VQA v2

The project is based on Cyanogenoid/vqa-counting. Most of the current VQA2.0 projects are based on https://github.com/hengyuan-hu/bottom-up-attention-vqa, while I personally prefer the Cyanogenoid's framework, because it's very clean and clear. So I reimplement several recent approaches including :

One of the benefit of our framework is that you can easily add counting module into your own model, which is proved to be effictive in imporving counting questions without harm the performance of your own model.

If my open source projects have inspired you, giving me some sponsorship will be a great help to my subsequent open source work. Support my subsequent open source work❤️🙏

Dependencies

Prepare dataset (FollowCyanogenoid/vqa-counting)

How to Train

All the models are named as XXX_model.py, and most of the parameters is under config.py. To change the model, simply change model_type in config.py. Then train your model with:

python train.py [optional-name]

Support training whole trainval split and generate result.json file for you to upload to the vqa2.0 online evaluation server

Model Details

Note that I didn't implement tfidf embedding of BAN model (though the current model has competitive/almost the same performance even without tfidf), only Glove Embedding is provided. About Intra- and Inter-modality Attention, Although I implemented all the details provided by the paper, it still seems not as good as the paper reported, even after I discussed with auther and made some modifications.

To Train Counting Model

Set following parameters in config.py:

model_type = 'counting'

To Train Bottom-up Top-down

model_type = 'baseline' 

To Train Bilinear attention network

model_type = 'ban' 

Note that BAN is very Memory Comsuming, so please ensure you got enough GPUs and run main.py with CUDA_VISIBLE_DEVICES=0,1,2,3

To Train Intra- and Inter-modality Attention

model_type = 'inter_intra' 

You may need to change the learning rate decay strategy as well from gradual_warmup_steps and lr_decay_epochs in config.py

To Train Learning Conditioned Graph Structures

model_type = 'graph' 

Though this method seem less competitive.

Looking for previous methods to compare in your experiments

Please refer to my CVPR 2019 oral paper:

@inproceedings{tang2018learning,
  title={Learning to Compose Dynamic Tree Structures for Visual Contexts},
  author={Tang, Kaihua and Zhang, Hanwang and Wu, Baoyuan and Luo, Wenhan and Liu, Wei},
  booktitle= "Conference on Computer Vision and Pattern Recognition",
  year={2019}
}