Extract mesh for SDJ - Githubissues

TLB-MISS commented 1 year ago

Hi! First of all, thank you so much for revealing such a wonderful work. I've checked this issue but still have a question. In the paper, it is said that 3D reconstruction was performed with sjc, not nerf. Even in the 3D reconstruction section of the README, run_zero123.py using sjc is shown as an example. However, there is no part in the sjc code that brings out 3D. Can you tell me the reason?

Thanks.

ruoshiliu commented 1 year ago

SJC optimizes a Voxel NeRF given a text prompt. Our framework uses an NVS network (zero123) to train a Voxel NeRF with similar techniques used in SJC. In addition, the issue provides a function to convert the trained Voxel NeRF to a mesh and export it. Hope this answers your question.

sjtuzq commented 1 year ago

Hi, thanks for sharing this cool work.

I have a question about the difference between zero123 and SJC with respect to the 3D reconstruction. It seems that the novel view images generated by the zero123 are not used during the reconstruction stage since the index is 0 in the run_zero123.py file.

The original image is only used for changing the model embedding (model.clip_emb and model.vae_emb) and all the left is quite the same as the SJC method. Since the novel view images are not used, why it is better than SJC?

ruoshiliu commented 1 year ago

Hi @sjtuzq , it's a good question. First of all, SJC is not a model for 3D reconstruction so it's not directly comparable. We did convert SJC into SJC-I which basically replaces the text-conditioned stable diffusion with an image-conditioned stable diffusion (see paper for more details). In comparison to SJC-I, we further replace the image-conditioned stable diffusion to a image-pose-conditioned novel view synthesis stable diffusion, which is zero123.

In my opinion, this is better than SJC-I in 3D reconstruction for multiple reasons. Here are two I can think of now:

zero123 is trained to generate a novel view of the same 3D asset, which implicitly enforces that the Voxel NeRF trained by zero123 is consistent with the input image. In comparison, image-conditioned stable diffusion generates the variation of the input images (see some examples shown here).
SJC-I, like many other 3D reconstructions based on diffusion distillation such as (NeRDi, RealFusion, NeuralLift) are highly susceptible to what's called the Janus problem because of viewpoint bias of stable diffusion -- it considers images of objects in canonical pose to be more likely than less frequently-seen pose. Since zero123 is finetuned on Objaverse with camera poses randomly sampled without biases, the finetuning processes implicitly correct the viewpoint bias of stable diffusion.

TLB-MISS commented 1 year ago

@ruoshiliu

Oh, I misunderstood this issue. Thank you for kindly answering my stupid question. I have a another question. Do you have any plan to support textured mesh export?

Thanks.

ruoshiliu commented 1 year ago

It's kindly implemented by @ashawkey in Stable-Dreamfusion!

TLB-MISS commented 1 year ago

@ruoshiliu

Thanks! I'll check it!

TLB-MISS commented 1 year ago

The last question : If I try zero123 on my dataset, does transforms_train.json significantly affect the result? Currently, I am using transforms of one of the given datasets (pikachu). Will there be any problems?

ruoshiliu commented 1 year ago

The only input required to the model in addition to an image is an elevation angle -- sorry if this is a little confusing. So probably the easiest thing to do here is find an image in nerf_wild where the input image looks similar to your image in terms of camera elevation angle w.r.t. the object, and replace the image with yours. I think the assumed angle for pikachu is around 15 degrees.

TLB-MISS commented 1 year ago

@ruoshiliu

I got it! Thank you so much for your kind answers to so many questions!

cvlab-columbia / zero123

Extract mesh for SDJ #38