bigshanedogg / survey

2 stars 0 forks source link

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA #23

Open bigshanedogg opened 2 years ago

bigshanedogg commented 2 years ago

Problem statement

  1. performance bottleneck in knowledge-based VQA due to two-phase architecture consists of knowledge retrieval from external soruces and training question answering task in supervised manner
    1. knowledge from external source may not be aligned with embedding space of reasoning model → retrieved feature can be treated as noisy or irrelevant one, although it is properly retrieved knowledge
    2. the re-embedded knowledge features might deviate from its original meaning in the source while reasoning
    3. multiple knowledge resources, such as Wikipedia, ConcepNet, Google images and others, are necessary
  2. learning a good joint knowledge-image-question representation requires sufficient training data
    1. hard to transfer to new types of questions

Baseline

Data details

name abbr type format source size description remark related tasks
VQAv2 image (image, question, answer) few-shot eval visual question answering
OKVQA image (image, question, answer) CoCO image corpus few-shot eval visual question answering

Approach

A. Model Architecture

B. Methodology

C. References

Evaluation

{model dataset remarks fine-tuned acc
KRISP (Marino 2021) OK-VQA Feature Emb. O 38.9
MAVEx (Wu 2021) OK-VQA Feature Emb. O 39.4
Frozen (Tsimpoukelli 2021) OK-VQA Feature Emb. X 12.6
PICa-Base OK-VQA Caption X 42.0
PICa-Base OK-VQA Caption+Tags X 43.3
PICa-Full OK-VQA Caption X 46.9
PICa-Full OK-VQA Caption+Tags X 48.0
Oscar (Li 2020) VQAv2 Feature Emb. O 73.8
Frozen (Tsimpoukelli 2021) VQAv2 Feature Emb. X 38.2
PICa-Base VQAv2 Caption X 53.2
PICa-Base VQAv2 Caption+Tags X 54.3
PICa-Full VQAv2 Caption X 55.9
PICa-Full VQAv2 Caption+Tags X 56.1
PICa-Full VQAv2 GT-Caption-5 X 59.7

Limitations

Follow-up Actions

bigshanedogg commented 2 years ago

2109.05014.pdf