Closed pptrick closed 5 months ago
Hi,
Thank you for your interest. Actually, we conducted several experiments with DiT at that time. Empirically, we found that U-ViT demonstrated greater stability during the training process.
Therefore, after considering both the model capabilities and the time constraints, we decided to use U-ViT as the network architecture for our project.
Thank you so much for your answer, that makes sense! I'm asking this because some papers conclude that it's better to pass conditions to cross attention with structures like you implemented in your DiT, instead of concatenating with the input.
I'm also a little bit curious about why the vae's latent size in reconstruction
is [256, 64] but it becomes [257, 64] in image2mesh
ddim sampling?
Cross-attention is a more flexible architecture and may be suitable for more cases than U-ViT.
Regarding the size of latents, our shape VAE produces latents of size 256x64. During the generation process, the diffusion model generates latents of the same size for further decoding into a neural field. The 257 refers to the length of the conditional tokens. We typically extract the feature of the last layer from the CLIP image encoder rather than its final embeddings. The last hidden layer tokens capture global and local information from the conditions. This approach helps the model learn to match the conditions to the target 3D shape latent in terms of global semantic information and aspects of the local semantic parts.
Hi Authors,
Thank you for your great work! I noticed that you implemented both
UNet-like Transformer
andDiT
in your code base but choseUNet-like Transformer
eventually. May I ask what's your concern about it? Are there any experiment that shows which one is better?