UNet-like Transformer V.S. DiT

NeuralCarver / Michelangelo

[NeurIPS 2023] Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

https://neuralcarver.github.io/michelangelo/

GNU General Public License v3.0

341 stars 11 forks source link

UNet-like Transformer V.S. DiT #9

Closed pptrick closed 5 months ago

pptrick commented 5 months ago

Hi Authors,

Thank you for your great work! I noticed that you implemented both UNet-like Transformer and DiT in your code base but chose UNet-like Transformer eventually. May I ask what's your concern about it? Are there any experiment that shows which one is better?

Maikouuu commented 5 months ago

Hi,

Thank you for your interest. Actually, we conducted several experiments with DiT at that time. Empirically, we found that U-ViT demonstrated greater stability during the training process.

Therefore, after considering both the model capabilities and the time constraints, we decided to use U-ViT as the network architecture for our project.

pptrick commented 5 months ago

Thank you so much for your answer, that makes sense! I'm asking this because some papers conclude that it's better to pass conditions to cross attention with structures like you implemented in your DiT, instead of concatenating with the input.

I'm also a little bit curious about why the vae's latent size in reconstruction is [256, 64] but it becomes [257, 64] in image2mesh ddim sampling?

Maikouuu commented 5 months ago

Cross-attention is a more flexible architecture and may be suitable for more cases than U-ViT.

Regarding the size of latents, our shape VAE produces latents of size 256x64. During the generation process, the diffusion model generates latents of the same size for further decoding into a neural field. The 257 refers to the length of the conditional tokens. We typically extract the feature of the last layer from the CLIP image encoder rather than its final embeddings. The last hidden layer tokens capture global and local information from the conditions. This approach helps the model learn to match the conditions to the target 3D shape latent in terms of global semantic information and aspects of the local semantic parts.