bigshanedogg / survey

2 stars 0 forks source link

[FROZEN] Multimodal Few-Shot Learning with Frozen Language Models #21

Open bigshanedogg opened 1 year ago

bigshanedogg commented 1 year ago

Problem statement

  1. Despite the impressive capabilities of large scale language models, the potential to modalities has not been fully demonstrated other than text.
  2. Aligning parameters of visual encoder and text encoder through fine-tuining both hurts generalization of text encoder.

Glossary

Baseline

Data details

name abbr type format source size description remark related tasks
Conceptual Captions CC3M image (image, caption) 3M image-text pretraining
VQAv2 image (image, question, answer) visual question answering
OKVQA image visual question answering

Approach

A. Model Architecture

Evaluation

model dataset fine-tuned remarks acc(n=0)
$Frozen_{base}$ OKVQA X 7B parameters 5.9
$Frozen_{small}$ OKVQA X 400M parameters 4.0
$Frozen_{finetuned}$ OKVQA X joint training 4.2
$Frozen_{train-blind}$ OKVQA X fine-tuned vision encoder w/o image 3.3
$Frozen_{VQA}$ OKVQA X fine-tuned on VQAv2 19.6
MAVEx OKVQA O image feature, joint trained while fine-tuning 39.4
$Frozen_{base}$ VQA v2 X 29.5
$Frozen_{scratch}$ VQA v2 X no gradient updates to neither vision encoder nor PLM 0
$Frozen_{finetuned}$ VQA v2 X 24.0
$Frozen_{train-blind}$ VQA v2 X 26.2
$Frozen_{VQA}$ VQA v2 O 48.4
Oscar VQA v2 O 73.8

Limitations

Follow-up Actions

bigshanedogg commented 1 year ago

2106.13884.pdf