Closed AndreiBarsan closed 7 years ago
This is more or less in place; the current baseline is even easier: it uses pre-computed, untrainable 4096-dimensional representations of the actual images.
I'll close this, and readd the CNN part as a separate task.
A simple baseline is described here: https://arxiv.org/pdf/1512.02167.pdf (Simple Baseline for Visual Question Answering)
It uses a bag-of-words approach to model the questions, and a relatively simple CNN for processing the image. It then uses a softmax to pick an answer, i.e. does no actual language synthesis.