Closed edward3862 closed 7 months ago
Hi, Thanks for the question. In our framework, we train two models for image-conditioned generation and text-conditioned generation, respectively. Moreover, we concate the conditioning tokens with the shape latents in the UNet-ViT.
Thanks for this interesting work!
I have one minor concern about the dimension of the image tokens E_i. As presented in Sec3.1, the dimension of E_i is (1+L_i)d, and the dimension of E_t is (1+L_t)d.
In my understanding, the dimension of the image tokens E_i and text tokens E_t should be different under CLIP ViT-L/14, which is 1024 and 768. During inference, E_i or E_t is injected into the same cross attention layer in the diffusion unet. Then how can we deal with the dimension difference issue?
Please point it out if I have misunderstood. Thank you:)