This paper obtains improved visual question answering by using gradient-based certainty attention regions.
The proposed method yields improved uncertainty estimates that are correspondingly more certain or uncertain, show consistent correlation with mis-classification and are focused quantitatively on better attention regions as compared to other states of the art methods.
The proposed architecture can be easily incorporated in various existing VQA methods.
It could be used as a general means for obtaining improved uncertainty and explanation regions for various vision and language tasks.
They obtain image feature and question feature using CNN and LSTM, respectively. Then they obtain attention mask using these features, and classification of the answer is done based on the attended feature.
paper
code coming soon
This paper obtains improved visual question answering by using gradient-based certainty attention regions.
The proposed method yields improved uncertainty estimates that are correspondingly more certain or uncertain, show consistent correlation with mis-classification and are focused quantitatively on better attention regions as compared to other states of the art methods.
The proposed architecture can be easily incorporated in various existing VQA methods.
It could be used as a general means for obtaining improved uncertainty and explanation regions for various vision and language tasks.