cvlab-kaist / 3DFuse

Official implementation of "Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation"
716 stars 42 forks source link

Questions related to paper #8

Closed Crane-YU closed 1 year ago

Crane-YU commented 1 year ago

Hi @j0seo, Nice repo and thanks for sharing your code. Just one question related to the semantic code sampling. As shown in the pipeline, the generated image in semantic code sampling is directly used for the coarse 3D point cloud generation. What if the generated object is shadowed or incomplete (e.g., only contain the upper body in the generated image). Do you have to manually pick the images for concept learning and the coarse 3D generation?

image

j0seo commented 1 year ago

Hi @Crane-YU , thank you for your interest in our work. The results presented in the paper and the project page were generated in an end-to-end manner with a fixed seed without manually selecting images. Adding a prefix like "a front view of" to the user prompt or using unCLIP-based diffusion models like Karlo mitigates this issue. Additionally, in Point-E, the condition image is roughly aligned because it takes the CLIP feature of the image, which is different from point cloud reconstruction models like MCC.

In the Gradio demo, we implemented a step-by-step process that allows users to confirm the desired shape of the point cloud before generating its 3D output, in order to emphasize the controllability aspect.

Crane-YU commented 1 year ago

Thank you for your reply.