3DTopia / LGM

[ECCV 2024 Oral] LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation.
https://me.kiui.moe/lgm/
MIT License
1.7k stars 117 forks source link

pixel level or latent level #26

Closed wenqsun closed 8 months ago

wenqsun commented 9 months ago

Thanks for your great work!

I noticed that in your work, the image input is pixel-level (consistent with the prior work: splatter image). I am wondering if we can consider using the VAE to encode the image into the latent code and train the unet in the latent space. If not, what may be the concern or drawback?

Thanks for your reply!

ashawkey commented 9 months ago

@wenqsun Hi, since we are interpreting the output features as Gaussians, and the number of Gaussians is determined by the output resolution, it maybe problematic if we are using latent space to compress the spatial dimensions. It may require different designs to make it work.

wenqsun commented 9 months ago

Thanks for your explanation. I think your concern makes sense, and I will try some designs.