Here is the implementation of our ACL-2021 Check It Again: Progressive Visual Question Answering via Visual Entailment. This repository contains code modified from here for SAR+SSL and here for SAR+LMH, many thanks!
cd data
bash download.sh
python preprocess_image.py --data trainval
python create_dictionary.py --dataroot vqacp2/
python preprocess_text.py --dataroot vqacp2/ --version v2
cd ..
The VQA model applied as Candidate Answer Selector(CAS) is a free choice in our framework. In this paper, we mainly use SSL as CAS.
The setting of model training of CAS can be refered in SSL.
To build the Dataset for the Answer Re-ranking module based on Visual Entailment, we modified the SSL's code of VQAFeatureDataset()
in dataset_vqacp.py and evaluate()
in train.py. The modified codes are avaliable in CAS_scripts
, just replace the corresponding class/function in SSL.
After the Candidate Answers Selecting Module, we can get train_top20_candidates.json
and test_top20_candidates.json
files as the training and test set for Answer Re-ranking Module,respectively. There are demos for the two output json file in data4VE
folder: train_dataset4VE_demo.json
, train_dataset4VE_demo.json
.
If you don't want to train CAS model(e.g. SSL) to build the datasets in the way mentioned above, you can download the rebuiled top20-candidate-answers dataset (with different Qiestion-Answer-Combination strategies) from here(C-train,C-test,R-train,R-test).
data4VE
folder, then the code will load and rebuild it into the entries
which will be feed in __getitem__()
of dataloader. (Skipping all data preprocessing steps of the Answer Re-ranking based on Visual Entailment directly)image_features
, image_spatials
, top20_score
, question_id
, QA_text_ids
, top20_label
, answer_type
, question_text
, LMH_bias
, where QA_text_ids
is the question-answer-combination(R/C) ids obtained/preprocessed from the LXMERT tokenizer. CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 0 --train_condi_ans_num 12
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 0 --train_condi_ans_num 20
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 1 --self_loss_weight 3 --train_condi_ans_num 12
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 1 --self_loss_weight 3 --train_condi_ans_num 20
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 2 --train_condi_ans_num 12
CUDA_VISIBLE_DEVICES=0,1 python SAR_main.py --output saved_models_cp2/ --lp 2 --train_condi_ans_num 20
The function evaluate()
in SAR_train.py
is used to select the best model during training, without QTD module yet. The trained QTD model is used in SAR_test.py
where we obtain the final test score.
CUDA_VISIBLE_DEVICES=0 python SAR_test.py --checkpoint_path4test saved_models_cp2/SAR_top12_best_model.pth --output saved_models_cp2/result/ --lp 0 --QTD_N4yesno 1 --QTD_N4non_yesno 12
CUDA_VISIBLE_DEVICES=0 python SAR_test.py --checkpoint_path4test saved_models_cp2/SAR_SSL_top12_best_model.pth --output saved_models_cp2/result/ --lp 1 --QTD_N4yesno 1 --QTD_N4non_yesno 12
CUDA_VISIBLE_DEVICES=0 python SAR_test.py --checkpoint_path4test saved_models_cp2/SAR_LMH_top12_best_model.pth --output saved_models_cp2/result/ --lp 2 --QTD_N4yesno 2 --QTD_N4non_yesno 12
R->C
Question-Answer Combination Strategy, which can always achieves or rivals the best performance on SAR/SAR+SSL/SAR+LMH. Specifically, we first use strategy R
( SAR_replace_dataset_vqacp.py
) at training, which is aimed at preventing the model from excessively focusing on the co-occurrence relation between question category and answer, and then use strategy C
(SAR_concatenate_dataset_vqacp.py
) at testing to introduce more information for inference. python comput_score.py --input saved_models_cp2/result/XX.json --dataroot data/vqacp2/cache
If you have any questions related to the code or the paper, feel free to email Qingyi (siqingyi@iie.ac.cn
). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
If you found this code is useful, please cite the following paper:
@inproceedings{si-etal-2021-check,
title = "Check It Again:Progressive Visual Question Answering via Visual Entailment",
author = "Si, Qingyi and
Lin, Zheng and
Zheng, Ming yu and
Fu, Peng and
Wang, Weiping",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.317",
doi = "10.18653/v1/2021.acl-long.317",
pages = "4101--4110",
abstract = "While sophisticated neural-based models have achieved remarkable success in Visual Question Answering (VQA), these models tend to answer questions only according to superficial correlations between question and answer. Several recent approaches have been developed to address this language priors problem. However, most of them predict the correct answer according to one best output without checking the authenticity of answers. Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers. In this paper, we propose a select-and-rerank (SAR) progressive framework based on Visual Entailment. Specifically, we first select the candidate answers relevant to the question or the image, then we rerank the candidate answers by a visual entailment task, which verifies whether the image semantically entails the synthetic statement of the question and each candidate answer. Experimental results show the effectiveness of our proposed framework, which establishes a new state-of-the-art accuracy on VQA-CP v2 with a 7.55{\%} improvement.",
}