SUDO-AI-3D / zero123plus

Code repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.
Apache License 2.0
1.56k stars 108 forks source link

training resolution 320^2 instead of 512^2? #70

Closed YuxuanSnow closed 3 months ago

YuxuanSnow commented 4 months ago

Dear Authors,

thanks for the great effort and open-sourcing the model.

I have a question regarding the inference resolution of the image. Basically, model diffuses one image, which is 2x3 sub images for 6 views. I see that at the inference time, resolution (640x960) is used, which means the resolution of each view is 320x320. Is that also the images you used in One-2-3-45++ to construct the feature volume?

I also tried to infer higher resolution (512x2, 512x3), and it generate following image, which has 3x5 views. Is this expected? The middle column as well as second and fourth row looks a bit as interpolated camera poses, compared to the (320x2, 320x3), which has 2x3 views: output output_

eliphatfs commented 3 months ago

How the model performs on other resolutions is actually unpredictable; it is interesting to see the model generates a 3x5 grid in this case. And yes we use 320x320 as the resolution for reconstruction.

YuxuanSnow commented 3 months ago

Interesting!!