NeuralCarver / Michelangelo

[NeurIPS 2023] Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation
https://neuralcarver.github.io/michelangelo/
GNU General Public License v3.0
341 stars 11 forks source link

Question about the dimension of image tokens E_i #3

Closed edward3862 closed 7 months ago

edward3862 commented 11 months ago

Thanks for this interesting work!

I have one minor concern about the dimension of the image tokens E_i. As presented in Sec3.1, the dimension of E_i is (1+L_i)d, and the dimension of E_t is (1+L_t)d.

In my understanding, the dimension of the image tokens E_i and text tokens E_t should be different under CLIP ViT-L/14, which is 1024 and 768. During inference, E_i or E_t is injected into the same cross attention layer in the diffusion unet. Then how can we deal with the dimension difference issue?

Please point it out if I have misunderstood. Thank you:)

Maikouuu commented 7 months ago

Hi, Thanks for the question. In our framework, we train two models for image-conditioned generation and text-conditioned generation, respectively. Moreover, we concate the conditioning tokens with the shape latents in the UNet-ViT.