Imagen

title: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
2205.11487
blog

highlight:
- 用frozen的pretrained的text encoder(比如Google自家的T5-XXL)在txt2img任务上非常有效(DALL-E不是这么做的吗？)
- 增加LLM规模比增大diffusion model规模更有效(有没有可能是text-image alignment的模型如CLIP做的还不够好?)
- 提出了一种新的thresholding diffusion sampler，优点是能够使用very large classifier-free guidance weights(不懂具体啥意思)
- 提出了一个more compute efficient, more memory efficient, and converges faster的UNet结构
- SOTA: FID score of 7.27 on the COCO dataset(这个值越低越好)，并且人类标注员评估判断生成图像和真实图像不相上下
对标模型: VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2
bmk: DrawBench(话说这和Parti的bmk区别在哪里？)
- 组合+基数(不同颜色、数量、形状的主体/物体做不同的动作)
- 空间关系
- 长文本
- 生僻词

DALL-E2
- 用CLIP获取latent prior，也用了级联diffusion models
- Imagen不需要CLIP来获取latent prior，而只需要用pretrained frozen text encoder就行
GLIDE
- 也用了级联的diffusion models
- 但是没用pretrained frozen text encoder
XMC-GAN
- 用了BERT作为text encoder
- 但是Imagen证明了更大的text encoder更有效(...说了又好像没说)
Parti和Imagen是姊妹篇paper

chaos-moon / paper_daily