CLIP系列 - Githubissues

zc12345 commented 1 year ago

CLIP

background

以前方法的缺点

Costly datasets
- ImageNet: 25,000 workers to annotate 14 million images for 22,000 object categories
- CLIP: 全部是FREE的网络公开数据，400 million text-image pair
Narrow
- 新的类别需要重新finetune
- CLIP不需要finetune，zero-shot
Poor real-world performance
- 在benchmark上表现很好，但是实际部署环境效果很差；可能是因为模型调优的时候是对着benchmark做的，存在某种意义上的作弊
- CLIP zero-shot，甚至不需要见到训练和测试数据，不存在作弊，因此可以认为benchmark性能和real-world性能是一致的

伪代码

# image_encoder - ResNet or Vision Transformer 
# text_encoder - CBOW or Text Transformer 
# I[n, h, w, c] - minibatch of aligned images 
# T[n, l] - minibatch of aligned texts 
# W_i[d_i, d_e] - learned proj of image to embed 
# W_t[d_t, d_e] - learned proj of text to embed 
# t - learned temperature parameter 

# extract feature representations of each modality 
I_f = image_encoder(I) #[n, d_i] 
T_f = text_encoder(T) #[n, d_t] 

# joint multimodal embedding [n, d_e] 
I_e = l2_normalize(np.dot(I_f, W_i), axis=1) 
T_e = l2_normalize(np.dot(T_f, W_t), axis=1) 

# scaled pairwise cosine similarities [n, n] 
logits = np.dot(I_e, T_e.T) * np.exp(t) # symmetric loss function 
labels = np.arange(n) 
loss_i = cross_entropy_loss(logits, labels, axis=0) 
loss_t = cross_entropy_loss(logits, labels, axis=1) 
loss = (loss_i + loss_t)/2

模型选择上的insight

如何成功把自然语言监督应用到vision任务上去？
- 关键在于找到更高效的方式（模型、训练方式）scaling化
生成式模型 vs 对比学习
- 生成式模型能够学到更好的特征，但是代价是同样精度下需要比对比学习高一个数量级的计算量
过拟合问题
- 在400M数据量下这不是一个主要问题……
loss：在 $N\times N$ 的余弦相似度矩阵下的symmetric cross entropy loss
aug：只有resize + square crop保证图片尺寸一致
超参：temperature参数 $\tau$ ，用于控制softmax输出的logits范围
和ConVIRT的对比[^ref_zhihu]
- ConVIRT中的image encoder的参数是ImageNet初始化的，而CLIP直接用random初始化
- ConVIRT的projection head是non-linear的，而CLIP采用linear的projection
- CLIP去掉了ConVIRT中text transformation（指均匀从text中采样句子），因为CLIP数据集中有很多只出现过一次的（image，text）

[^ref_zhihu]: 【CLIP系列Paper解读】CLIP: Learning Transferable Visual Models From Natural Language Supervision

实现

image encoder：可以用任意image encoder，文中选取了五种类ResNet的CNN和三种ViT
- ResNet
  - ResNet-50/101
  - 基于Efficient魔改的类ResNet，参数是ResNet的4/16/64倍：RN50x4/ RN50x16/ RN50x64
- ViT
  - ViT-B/32, a ViT-B/16, and a ViT-L/14
text encoder：Transformer
- a 63M-parameter 12layer 512-wide model with 8 attention heads
- transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size
- the max sequence length was capped at 76（降低运算量考虑）
- CLIP的性能和text encoder模型容量影响不是很大（说明图像还是比文本更难？）
训练设置
- 32epoch
- 搜索最优超参：在第一个epoch基于baseline的ResNet50做grid searches, random search, and manual tuning
- temperature： $\tau$ 初始化为0.07，大于100做截断
- mini-batch大小为32768
- 各种工程trick用于节省显存占用，加速训练
  - gradient checkpointing
  - half-precision Adam statistics
  - half-precision stochastically rounded text encoder weights
  - ...

limitations

在抽象概念上做的不好
- 不会数数，比如图片中有多少物体
- 无法判断空间关系，比如一辆车的远近
不在pretrain数据集上的东西泛化性不好
- 在MNIST数据集只有88%（远低于人类的99.75%）
- zero-shot分类对prompt很敏感，需要prompt工程

思考

CLIP人称vanilla ConVIRT……就是把ConVIRT因为医学影像研究本身特点做的一些trick给去掉，换成了更适合自然图像的方式
其实CLIP做的最好的点是工程优化和钞能力，把一个naive的方法用大模型大数据在可容忍的时间内训出来了，而且效果是真的好，不finetune就打平ImageNet当时的SOTA
在精细化标注注定耗时耗力存在滞后性的情况下，无论是带大量脏数据的有监督，还是text-image pair，都是想办法提高数据利用率。什么样的数据是有效的？什么样的模式是更高效的？

zc12345 commented 1 year ago

Unifying image-caption and image-classification datasets with prefix conditioning

utils

TL;DR

背景
- CLIP等Vision-Language预训练模型对于解决open-set问题更有希望，在few/zero-shot上性能强大
- caption数据集和分类数据集
  - CLIP使用的image caption数据集数据分布更imbalance
  - 而用于分类的数据集ImageNet等数据标签更平衡，但是没法直接用于CLIP等的训练(常规方式是使用prompt如"A photo of a {label}."把label改写成caption)，会产生biased representations
思路：prompt tuning的方式把image caption数据集和classification数据集统一起来，从而消除两个数据集之间的bias
Prompt Tuning
- 针对LLM应用到不同下游任务的finetune需求，提出了一个比model finetune成本更低，比prompt design效率更高的方式
- 在text prompt输入到LLM打成token的时候，prefix几个可学习的token，每次迁移到不同下游任务只需要finetune soft prompt(learnable token)
具体方式：Prefix Conditioning
- 针对不同数据集分别学习不同的prefix token，以消除不同数据集间的偏置
domain shift问题
- 实际效果上caption数据集上的prefix token会导致类ImageNet的数据集上效果变好，但是domain偏差较大(cartoon/sketch style)的数据集效果反而变差，感觉并不是一个期望的结果。作者给出的解释是optimal prefix与test domain和classification dataset的差异有关？不同domain 的prefix token似乎应该有差异？

zc12345 commented 1 year ago

Unified Contrastive Learning in Image-Text-Label Space

TL;DR

2204.03610 CVPR 2022
把CLIP的预训练从image-text pair扩充到image-text-label triplet

chaos-moon / paper_daily

CLIP系列 #18

CLIP

background

以前方法的缺点

相关方法

method

伪代码

模型选择上的insight

实现

limitations

思考

Unifying image-caption and image-classification datasets with prefix conditioning

utils

TL;DR

Unified Contrastive Learning in Image-Text-Label Space

TL;DR