[FROZEN] Multimodal Few-Shot Learning with Frozen Language Models

Problem statement

Despite the impressive capabilities of large scale language models, the potential to modalities has not been fully demonstrated other than text.
Aligning parameters of visual encoder and text encoder through fine-tuining both hurts generalization of text encoder.

Glossary

용어 & 개념 정의 줍줍

Baseline

Oscar: previous VQA SOTA model fine-tuned on VQAv2
MAVEx: previous VQA SOTA model fine-tuned on OKVQA

Data details

name	abbr	type	format	source	size
Conceptual Captions	CC3M	image	(image, caption)	3M	image-text pretraining
VQAv2	image	(image, question, answer)	visual question answering
OKVQA	image	visual question answering

Approach

A. Model Architecture

Frozen, large scale language model without gradients update, to generate captions (i.e. predict answers) given conditions
Align vision encoder to the embedding space of Frozen in order to represent images in a form that the transformer already understand.
Model inputs consists of image features and in-context examples
B. Methodology
dynamic image conditional prefix tuning, modify previous prefix tuning instead of task-specific static biased term, feeding dynamic image in each inference
leverage encyclopedic knowledge in the language model towards visual tasks, use PLM as an implicit knowledge base
Variation of inner-shots, shots, and repeats
C. References
prefix tuning feeding static biased term:
- [Prefix-tuning: Optimizing continuous prompts for generation](https://arxiv.org/abs/2101.00190) (2021)
- [The power of scale for parameter-efficient prompt tuning](https://arxiv.org/abs/2104.08691) (2021)
aligned data on task-agnostic cross-modal objectives and fine-tuned
- [Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks](https://arxiv.org/abs/1908.02265) (2019)
- [Vl-bert: Pre-training of generic visual-linguistic representations](https://arxiv.org/abs/1908.08530) (2019)
text generation for task-general multimodal models, but update all weights without considering zero-shot or few-shot
- [Unifying vision-and-language tasks via text generation](https://arxiv.org/abs/2102.02779) (2021)
As opposed to Frozen, update weights of text encoder and fix vision encoder
- [Encoderagnostic adaptation for conditional language generation](https://arxiv.org/abs/1908.06938) (2019)
- [Visualgpt: Data-efficient adaptation of pretrained language models for image captioning](https://arxiv.org/abs/2102.10407) (2021)

Evaluation

joint training without sufficient data hinders reasoning capabilities of PLM rather than improves it
- OKVQA, $Frozen{base}$ vs $Frozen{finetuned}$
- VQAv2, $Frozen{base}$ vs $Frozen{finetuned}$
zero-shot capability is dependent on scale (# of parameters), since it is also dependent on PLM’s implicit knowledge
- OKVQA, $Frozen{base}$ vs $Frozen{small}$
pre-training on similar task with identical modalities is meaningful enough without accompanying fine-tuning
- VQAv2, $Frozen{base}$ vs $Frozen{VQA}$

model	dataset	fine-tuned	remarks	acc(n=0)
$Frozen_{base}$	OKVQA	X	7B parameters	5.9
$Frozen_{small}$	OKVQA	X	400M parameters	4.0
$Frozen_{finetuned}$	OKVQA	X	joint training	4.2
$Frozen_{train-blind}$	OKVQA	X	fine-tuned vision encoder w/o image	3.3
$Frozen_{VQA}$	OKVQA	X	fine-tuned on VQAv2	19.6
MAVEx	OKVQA	O	image feature, joint trained while fine-tuning	39.4
$Frozen_{base}$	VQA v2	X	29.5
$Frozen_{scratch}$	VQA v2	X	no gradient updates to neither vision encoder nor PLM	0
$Frozen_{finetuned}$	VQA v2	X	24.0
$Frozen_{train-blind}$	VQA v2	X	26.2
$Frozen_{VQA}$	VQA v2	O	48.4
Oscar	VQA v2	O	73.8

Limitations

Assuming that aligning embedding space is required to perform multimodal tasks, joint training is not a mandatory approach
- It can be also achieved by fixing a model and updating weights of the others
- Compared to previous works, fixing large scale model is more advantageous to leverage pre-trained implicit knowledge
Adpated rapidly from captioning behavior of model to question-answering behavior with simple prompting or few-shot learning
Observed the potential of encyclopedic knowledge grounded on image and language clues
- If a sufficient amount of data is not guaranteed, exploiting pre-trained implicit knowledge may be a safer choice to generalization than fine-tuning.

Follow-up Actions

How to convey image condition without learned implicit knowledge of PLM?

bigshanedogg / survey

[FROZEN] Multimodal Few-Shot Learning with Frozen Language Models #21

Problem statement

Glossary

Baseline

Data details

Approach

A. Model Architecture

B. Methodology

C. References

Evaluation

Limitations

Follow-up Actions