performance bottleneck in knowledge-based VQA due to two-phase architecture consists of knowledge retrieval from external soruces and training question answering task in supervised manner
knowledge from external source may not be aligned with embedding space of reasoning model → retrieved feature can be treated as noisy or irrelevant one, although it is properly retrieved knowledge
the re-embedded knowledge features might deviate from its original meaning in the source while reasoning
multiple knowledge resources, such as Wikipedia, ConcepNet, Google images and others, are necessary
learning a good joint knowledge-image-question representation requires sufficient training data
hard to transfer to new types of questions
Baseline
Frozen: feeding image after feature embedding, considering few-shot, and update weights of vision encoder not to degenerate linguistic knowledge of PLM learned while pretraining
Data details
name
abbr
type
format
source
size
description
remark
related tasks
VQAv2
image
(image, question, answer)
few-shot eval
visual question answering
OKVQA
image
(image, question, answer)
CoCO image corpus
few-shot eval
visual question answering
Approach
A. Model Architecture
prompt captions or tags converted from image to GPT-3 in natural language as conditions to generate proper answers
proved empirically that textual descriptions converted from image context leads to a strong baseline for VQA
All inputs are translated as text, and evaluate GPT-3’s reasoning capabilities given n fine-grained examples
B. Methodology
via prompt engineering, utilize GPT-3 as an implicit and unstructured knowledge base → problem(1)
adapt to the VQA task with a few in-context examples during inference time, instead of supervised fine-tuning → problem(2-a)
generate captions and tags given image with SOTA captioning models
In-context examples are substantial, however, the number of in-context examples are contrained by max sequence length of PLM and the number of available examples out of total examples
how to better select k in-context examples
encode question and image with image encoder and text encoder of CLIP respectively
compte text similarities and image similarities with questions and images of all available examples (encode question and images of examples at first, and cache them)
calcuate average of each pair of text similarity and image similarity, and sort top k examples
multi-query ensemble, in order to furthre unleash the power of GPT-3, indicates that iterate answer prediction k times on every selecting n examples, and use one with the highest sum of log-probability as the final answer.
C. References
train visual encoder on image captioning task with gradients being back-propagated from the frozen PLM - Why did not the previous model ‘Frozen’ work?
What is the difference between two image featureb embedding models, Oscar and KRISP(or MAVEx)? One performed better than PICa on VQAv2, and the other lagged behind despite fine-tuning.
PICa families suprasses previous SOTA models without fine-tuning
‘Feature embedding’ indicates predicting answers given vectorized image feature and question, while ‘Caption’ or ‘Caption+Tags’ means prediction on generated text given image.
PICa-Full, which adopted example selection and multi-query ensemble strategies, empirically outperforms PICa-Base, and it can be explained that fine-grained examples are more informative to GPT-3 than random selected examples.
{model
dataset
remarks
fine-tuned
acc
KRISP (Marino 2021)
OK-VQA
Feature Emb.
O
38.9
MAVEx (Wu 2021)
OK-VQA
Feature Emb.
O
39.4
Frozen (Tsimpoukelli 2021)
OK-VQA
Feature Emb.
X
12.6
PICa-Base
OK-VQA
Caption
X
42.0
PICa-Base
OK-VQA
Caption+Tags
X
43.3
PICa-Full
OK-VQA
Caption
X
46.9
PICa-Full
OK-VQA
Caption+Tags
X
48.0
Oscar (Li 2020)
VQAv2
Feature Emb.
O
73.8
Frozen (Tsimpoukelli 2021)
VQAv2
Feature Emb.
X
38.2
PICa-Base
VQAv2
Caption
X
53.2
PICa-Base
VQAv2
Caption+Tags
X
54.3
PICa-Full
VQAv2
Caption
X
55.9
PICa-Full
VQAv2
Caption+Tags
X
56.1
PICa-Full
VQAv2
GT-Caption-5
X
59.7
Limitations
Performance and the # of examples have a positive correlation, and so are performance and the quality of examples.
The quality of examples includes similarities between in-context examples and target input, and the degree to which the captions and tags contains details.
When computing similarity between example and target, it was observed that text similarity(question) is more involved that image similarity.
Feeding bad examples (i.e. dissimilar examples) led to worse performance.
GPT-3 performs the role of implicit knowledge base without other external sources, since the performance PICa soley depends on GPT-3’s innate knowledge and reasoning capability.
GPT-3 also generate reasonable rationales for different type of questions such as ‘This is because’ in ablation
PICa sometimes answers well to questions which requires focus on part of image, since GPT-3 has high coverage on reasoning.
The limitations are observed when
question requires concise observation to detect the target object and localized answers to specific image that are far from commonsense reasoning
major features of target object are distracted even if the object is dominant on entire image (e.g. giraffes and trees)
Follow-up Actions
How to overcome information loss while converting image to captions?
Although feature embedding-based approaches underperforms in this paper, how about feeding image feature to text generator trained in CLIP-like manner?
The way selecting in-context examples in this paper is a horizontal expansion of target, which indicates questions in similar hierarchy.
How about adopting vertical expansion through selecting hierarchical in-context examples which are in relations of hypernym and/or hyponym?
Problem statement
Baseline
Data details
Approach
A. Model Architecture
B. Methodology
C. References
Evaluation
Limitations
Follow-up Actions