borisdayma / dalle-mini

DALL·E Mini - Generate images from a text prompt
https://www.craiyon.com
Apache License 2.0
14.7k stars 1.19k forks source link

Why we use diffusion for prior model #313

Open xiaotingxuan opened 1 year ago

xiaotingxuan commented 1 year ago

Hi , I am a greenhorn for diffusion model

According to Dalle2 paper,Prior model is used to predict clip image embeddings from clip text embeddings. I think they design this model to minimize modality gap

I just don't konw why we need to use diffusion model for Prior. I know we can, but why we don't use a simpler network(I don't know ,maybe just a MLP) to implements the mapping from image embeddings to text embeddings. Diffusion model is quite expensive in terms of time and compute.

borisdayma commented 1 year ago

In the paper they claim it trains faster but you can probably get similar results without it.