chaos-moon / paper_daily

One paper a day, keep laziness away.

MIT License

6 stars 3 forks source link

BLIP

paper
code
blog
institution: Salesforce Research

TL;DR

提出了一个用于vision-language理解和生成任务的统一框架（BERT-based的pre-train模型+自举的大规模web数据集生成方式），并在VQA/image caption/image-text retrieval等7项任务实现了SOTA

CLIP vs BLIP

模型效果上
- CLIP在image-text retrieval上效果更好
- BLIP在image caption任务上效果更好(stable diffusion webui支持用BLIP从image转text prompt)
不同模态encoder模型
- CLIP是ResNet/ViT(image)+transformer(text)
- BLIP是ViT(image)+BERT(text)
讲故事侧重点不同
- CLIP侧重点在text和image align到同一特征空间，feature特征跨模态，只有一个encoder，不太关心下游任务的decoder
- BLIP侧重点在自举的数据集(data)和统一的vision-language任务框架，将下游任务统一在同一大框架下（但是不同任务还是分别有单独的model）

method

contribution

model层面
- CLIP等vision-language pretrain模型都是encoder-based，但是不适合迁移到text generation任务；encoder-decoder模型也没有被成功用于image-text retrieval任务
- 本文提出Multimodal mixture of Encoder-Decoder (MED)既能作为encoder也能作为decoder处理多种下游任务
data层面
- CLIP等SOTA模型直接在web数据上训练的，带有很多噪声
- 本文提出Captioning and Filtering (CapFilt)合成image caption并移除caption数据(包括原始的和合成的caption)中的noisy label

model

Unimodal encoder
- image: ViT
- text: BERT，加入[CLS] token插入到sentence开始，用于summarize整个sentence
Image-grounded text encoder
- 在self-attention层和FFN层之间插入cross attention，将visual信息注入网络
- 加入[Encode] token用于获取image-text的跨模态表征
Image-grounded text decoder
- 将普通self-attention层代替双向self-attention，用于生成sequence
- [Decode]token表示sequence开始

loss

三个loss联合训练

Image-Text Contrastive Loss (ITC): 类似CLIP，用于text-image align
Image-Text Matching Loss (ITM): 一个二分类loss，设计上用于捕获更细粒度的vision-language alignment，加入了hard negative mining策略
Language Modeling Loss (LM): 自回归任务的交叉熵损失，用于优化caption等文本生成任务

数据

下游任务

思考

相对于CLIP简洁的风格，BLIP可以说有点过于冗杂了
本质上是把好几个不同任务的model拼到一起，然后尽可能复用其中的模块，显得东西多而庞杂

BLIP-2

BLIP-2 vs GPT-4[^blip2_blog]

[^blip2_blog]: 基本来自于blip2 blog

Generic vs. Specific

BLIP2是一种用于vision-language pretrain的通用多模态预训练技术，能帮助LLM理解图像，实现zero-shot的image-text generation
GPT4是一个特定预训练模型，不知道具体到底用了啥技术

Open-source vs. Closed-source (API-only)

Fast vs. Slow: BLIP2快得多

Unsupervised learning vs. (presumably) Supervised learning

BLIP2使用的是web数据，自标注
GPT4从ChatGPT上合理推断使用了大量人工标注数据

method

模型结构

Image Encoder和LLM都是预训练好的，只有一个轻量化的Q-Former需要训练

Q-Former

作用: 在Image Encoder(输出特征 $257 \times 1024$ for ViT-L/14)和LLM之间起到桥梁作用，使得图像特征能够被LLM理解
参数量: 188M
queries: $32\times 768$，是一组可学习的representation/token，queries进queries出，在这个过程中Q-Former不断过self-attention把重要的图像特征给学到queries中，最后送入LLM

训练

预训练stage-1

Image Encoder(Freezed) + Q-Former联合训练，进行vision-and-language representation learning，Q-Former学习提取与文本最相关的图像特征

预训练stage-2

Image Encoder(Freezed) + Q-Former + LLM(Freezed)联合训练，进行vision-to-language generative learning，训练Q-Former使得输出特征能够被LLM解码并生成相关文本。这一部分可以用Decoder-only的LLM(OPT等)，也可以用Encoder-Decoder结构的LLM(FlanT5等)

思考

说是BLIP-2，但是和BLIP除了数据处理和刷的榜一样，其他已经相去甚远。本质上是一个多轮in-context VQA，或者说是初级版的GPT4(本文发表于2023.01，GPT4发布于2023.03)。但是GPT4和BLIP2应该都是很早就已经开始在做了。看来大家都早早看到了ChatGPT+多模态输入/输出的前景

ALBEF(Align before Fuse: Vision and Language Representation Learning with Momentum Distillation)这篇文章其实是BLIP的前身，BLIP的模型部分和ALBEF非常相似，只是ALBEF还没有BLIP那么全能

BLIP-2官方博客内涵~~OpenAI~~CloseAI，笑死我了

有一个延伸的有意思的论文ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions，论文内容基本就是论文名。ChatGPT问问题，BLIP-2根据图片内容回答，最后总结多轮对话内容生成更详细的image caption。当然这一切都建立在两边LLM都足够强的基础上，不然就变成两个智障阿巴阿巴……

chaos-moon / paper_daily

BLIP系列 #25

BLIP

TL;DR

CLIP vs BLIP

method

contribution

model

loss

数据

下游任务

思考

BLIP-2

BLIP-2 vs GPT-4[^blip2_blog]

method

模型结构

训练

思考