issues
search
bigshanedogg
/
survey
2
stars
0
forks
source link
[FROZEN] Multimodal Few-Shot Learning with Frozen Language Models
#21
Open
bigshanedogg
opened
2 years ago
bigshanedogg
commented
2 years ago
Problem statement
Despite the impressive capabilities of large scale language models, the potential to modalities has not been fully demonstrated other than text.
Aligning parameters of visual encoder and text encoder through fine-tuining both hurts generalization of text encoder.
Glossary
용어 & 개념 정의 줍줍
Baseline
Oscar: previous VQA SOTA model fine-tuned on VQAv2
MAVEx: previous VQA SOTA model fine-tuned on OKVQA
Data details
name
abbr
type
format
source
size
description
remark
related tasks
Conceptual Captions
CC3M
image
(image, caption)
3M
image-text pretraining
VQAv2
image
(image, question, answer)
visual question answering
OKVQA
image
visual question answering
Approach
A. Model Architecture
Frozen, large scale language model without gradients update, to generate captions (i.e. predict answers) given conditions
Align vision encoder to the embedding space of Frozen in order to represent images in a form that the transformer already understand.
Model inputs consists of image features and in-context examples
B. Methodology
dynamic image conditional prefix tuning, modify previous prefix tuning instead of task-specific static biased term, feeding dynamic image in each inference
leverage encyclopedic knowledge in the language model towards visual tasks, use PLM as an implicit knowledge base
Variation of inner-shots, shots, and repeats
C. References
prefix tuning feeding static biased term:
[Prefix-tuning: Optimizing continuous prompts for generation](https://arxiv.org/abs/2101.00190)
(2021)
[The power of scale for parameter-efficient prompt tuning](https://arxiv.org/abs/2104.08691)
(2021)
aligned data on task-agnostic cross-modal objectives and fine-tuned
[Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks](https://arxiv.org/abs/1908.02265)
(2019)
[Vl-bert: Pre-training of generic visual-linguistic representations](https://arxiv.org/abs/1908.08530)
(2019)
text generation for task-general multimodal models, but update all weights without considering zero-shot or few-shot
[Unifying vision-and-language tasks via text generation](https://arxiv.org/abs/2102.02779)
(2021)
As opposed to Frozen, update weights of text encoder and fix vision encoder
[Encoderagnostic adaptation for conditional language generation](https://arxiv.org/abs/1908.06938)
(2019)
[Visualgpt: Data-efficient adaptation of pretrained language models for image captioning](https://arxiv.org/abs/2102.10407)
(2021)
Evaluation
joint training without sufficient data hinders reasoning capabilities of PLM rather than improves it
OKVQA, $Frozen
{base}$ vs $Frozen
{finetuned}$
VQAv2, $Frozen
{base}$ vs $Frozen
{finetuned}$
zero-shot capability is dependent on scale (# of parameters), since it is also dependent on PLM’s implicit knowledge
OKVQA, $Frozen
{base}$ vs $Frozen
{small}$
pre-training on similar task with identical modalities is meaningful enough without accompanying fine-tuning
VQAv2, $Frozen
{base}$ vs $Frozen
{VQA}$
model
dataset
fine-tuned
remarks
acc(n=0)
$Frozen_{base}$
OKVQA
X
7B parameters
5.9
$Frozen_{small}$
OKVQA
X
400M parameters
4.0
$Frozen_{finetuned}$
OKVQA
X
joint training
4.2
$Frozen_{train-blind}$
OKVQA
X
fine-tuned vision encoder w/o image
3.3
$Frozen_{VQA}$
OKVQA
X
fine-tuned on VQAv2
19.6
MAVEx
OKVQA
O
image feature, joint trained while fine-tuning
39.4
$Frozen_{base}$
VQA v2
X
29.5
$Frozen_{scratch}$
VQA v2
X
no gradient updates to neither vision encoder nor PLM
0
$Frozen_{finetuned}$
VQA v2
X
24.0
$Frozen_{train-blind}$
VQA v2
X
26.2
$Frozen_{VQA}$
VQA v2
O
48.4
Oscar
VQA v2
O
73.8
Limitations
Assuming that aligning embedding space is required to perform multimodal tasks,
joint training is not a mandatory approach
It can be also achieved by fixing a model and updating weights of the others
Compared to previous works,
fixing large scale model is more advantageous to leverage pre-trained implicit knowledge
Adpated rapidly from captioning behavior of model to question-answering behavior with simple prompting or few-shot learning
Observed the potential of encyclopedic knowledge grounded on image and language clues
If a sufficient amount of data is not guaranteed, exploiting pre-trained implicit knowledge may be a safer choice to generalization than fine-tuning.
Follow-up Actions
How to convey image condition without learned implicit knowledge of PLM?
bigshanedogg
commented
2 years ago
2106.13884.pdf
Problem statement
Glossary
Baseline
Data details
Approach
A. Model Architecture
B. Methodology
C. References
Evaluation
Limitations
Follow-up Actions