An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Problem statement

performance bottleneck in knowledge-based VQA due to two-phase architecture consists of knowledge retrieval from external soruces and training question answering task in supervised manner
1. knowledge from external source may not be aligned with embedding space of reasoning model → retrieved feature can be treated as noisy or irrelevant one, although it is properly retrieved knowledge
2. the re-embedded knowledge features might deviate from its original meaning in the source while reasoning
3. multiple knowledge resources, such as Wikipedia, ConcepNet, Google images and others, are necessary
learning a good joint knowledge-image-question representation requires sufficient training data
1. hard to transfer to new types of questions

Frozen: feeding image after feature embedding, considering few-shot, and update weights of vision encoder not to degenerate linguistic knowledge of PLM learned while pretraining

name	abbr	type	format	source	size	description	remark	related tasks
VQAv2	image	(image, question, answer)	few-shot eval	visual question answering
OKVQA	image	(image, question, answer)	CoCO image corpus	few-shot eval	visual question answering

prompt captions or tags converted from image to GPT-3 in natural language as conditions to generate proper answers
- proved empirically that textual descriptions converted from image context leads to a strong baseline for VQA
All inputs are translated as text, and evaluate GPT-3’s reasoning capabilities given n fine-grained examples

via prompt engineering, utilize GPT-3 as an implicit and unstructured knowledge base → problem(1)
adapt to the VQA task with a few in-context examples during inference time, instead of supervised fine-tuning → problem(2-a)
- $input = prompt_{head}; (c_1, q_1, a_1); (c_2, q_2, a_2); ... ; (c_k, q_k, ak); (c{new}, q_{new})$
- $output = a_{new}$
generate captions and tags given image with SOTA captioning models
In-context examples are substantial, however, the number of in-context examples are contrained by max sequence length of PLM and the number of available examples out of total examples
- how to better select k in-context examples
  1. encode question and image with image encoder and text encoder of CLIP respectively
  2. compte text similarities and image similarities with questions and images of all available examples (encode question and images of examples at first, and cache them)
  3. calcuate average of each pair of text similarity and image similarity, and sort top k examples
multi-query ensemble, in order to furthre unleash the power of GPT-3, indicates that iterate answer prediction k times on every selecting n examples, and use one with the highest sum of log-probability as the final answer.

train visual encoder on image captioning task with gradients being back-propagated from the frozen PLM - Why did not the previous model ‘Frozen’ work?
- [Multimodal Few-Shot Learning with Frozen Language Models](https://arxiv.org/abs/2106.13884) (2021)
What is the difference between two image featureb embedding models, Oscar and KRISP(or MAVEx)? One performed better than PICa on VQAv2, and the other lagged behind despite fine-tuning.
- [Multi-Modal Answer Validation for Knowledge-Based VQA](https://arxiv.org/abs/2103.12248) (2021)
- [KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA](https://arxiv.org/abs/2012.11014) (2021)

PICa families suprasses previous SOTA models without fine-tuning
- ‘Feature embedding’ indicates predicting answers given vectorized image feature and question, while ‘Caption’ or ‘Caption+Tags’ means prediction on generated text given image.
PICa-Full, which adopted example selection and multi-query ensemble strategies, empirically outperforms PICa-Base, and it can be explained that fine-grained examples are more informative to GPT-3 than random selected examples.

{model	dataset	remarks	fine-tuned	acc
KRISP (Marino 2021)	OK-VQA	Feature Emb.	O	38.9
MAVEx (Wu 2021)	OK-VQA	Feature Emb.	O	39.4
Frozen (Tsimpoukelli 2021)	OK-VQA	Feature Emb.	X	12.6
PICa-Base	OK-VQA	Caption	X	42.0
PICa-Base	OK-VQA	Caption+Tags	X	43.3
PICa-Full	OK-VQA	Caption	X	46.9
PICa-Full	OK-VQA	Caption+Tags	X	48.0
Oscar (Li 2020)	VQAv2	Feature Emb.	O	73.8
Frozen (Tsimpoukelli 2021)	VQAv2	Feature Emb.	X	38.2
PICa-Base	VQAv2	Caption	X	53.2
PICa-Base	VQAv2	Caption+Tags	X	54.3
PICa-Full	VQAv2	Caption	X	55.9
PICa-Full	VQAv2	Caption+Tags	X	56.1
PICa-Full	VQAv2	GT-Caption-5	X	59.7

Performance and the # of examples have a positive correlation, and so are performance and the quality of examples.
- The quality of examples includes similarities between in-context examples and target input, and the degree to which the captions and tags contains details.
- When computing similarity between example and target, it was observed that text similarity(question) is more involved that image similarity.
- Feeding bad examples (i.e. dissimilar examples) led to worse performance.
GPT-3 performs the role of implicit knowledge base without other external sources, since the performance PICa soley depends on GPT-3’s innate knowledge and reasoning capability.
- GPT-3 also generate reasonable rationales for different type of questions such as ‘This is because’ in ablation
PICa sometimes answers well to questions which requires focus on part of image, since GPT-3 has high coverage on reasoning.
- The limitations are observed when
  - question requires concise observation to detect the target object and localized answers to specific image that are far from commonsense reasoning
  - major features of target object are distracted even if the object is dominant on entire image (e.g. giraffes and trees)

How to overcome information loss while converting image to captions?
- Although feature embedding-based approaches underperforms in this paper, how about feeding image feature to text generator trained in CLIP-like manner?
The way selecting in-context examples in this paper is a horizontal expansion of target, which indicates questions in similar hierarchy.
- How about adopting vertical expansion through selecting hierarchical in-context examples which are in relations of hypernym and/or hyponym?