Question about the dimension of image tokens E_i

NeuralCarver / Michelangelo

[NeurIPS 2023] Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

GNU General Public License v3.0

341 stars 11 forks source link

Thanks for this interesting work!

I have one minor concern about the dimension of the image tokens E_i. As presented in Sec3.1, the dimension of E_i is (1+L_i)d, and the dimension of E_t is (1+L_t)d.

In my understanding, the dimension of the image tokens E_i and text tokens E_t should be different under CLIP ViT-L/14, which is 1024 and 768. During inference, E_i or E_t is injected into the same cross attention layer in the diffusion unet. Then how can we deal with the dimension difference issue?

Please point it out if I have misunderstood. Thank you:)

NeuralCarver / Michelangelo

Question about the dimension of image tokens E_i #3