According to Dalle2 paper,Prior model is used to predict clip image embeddings from clip text embeddings. I think they design this model to minimize modality gap
I just don't konw why we need to use diffusion model for Prior. I know we can, but why we don't use a simpler network(I don't know ,maybe just a MLP) to implements the mapping from image embeddings to text embeddings. Diffusion model is quite expensive in terms of time and compute.
Hi , I am a greenhorn for diffusion model
According to Dalle2 paper,Prior model is used to predict clip image embeddings from clip text embeddings. I think they design this model to minimize modality gap
I just don't konw why we need to use diffusion model for Prior. I know we can, but why we don't use a simpler network(I don't know ,maybe just a MLP) to implements the mapping from image embeddings to text embeddings. Diffusion model is quite expensive in terms of time and compute.