DeepRNN / visual_question_answering

Tensorflow implementation of "Dynamic Memory Networks for Visual and Textual Question Answering"
MIT License
80 stars 40 forks source link

Introduction

This neural system for visual question answering is roughly based on the paper "Dynamic Memory Networks for Visual and Textual Question Answering" by Xiong et al. (ICML2016). The input is an image and a question about the image, and the output is a one-word answer to this question. It uses a convolutional neural network to extract visual features from the image, and uses a bi-directional GRU recurrent neural network to fuse these features. Meanwhile, it uses either a GRU recurrent neural network or a positional encoding scheme to encode the question. Then, it utilizes a dynamic memory network with an attention mechanism to generate the answer based on this information. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts.

Prerequisites

Usage

Results

A pretrained model with default configuration can be downloaded here. This model was trained solely on the VQA v1 training data. It achieves accuracy 60.35% on the VQA v1 validation data. Here are some successful examples: examples

References