ashawkey / stable-dreamfusion

Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion.
Apache License 2.0
8.34k stars 735 forks source link

[Stupid question ahead] Why not using Imagen-pytorch instead of StableDiffusion if SD leads to extra training time? #125

Open raspiduino opened 1 year ago

raspiduino commented 1 year ago

I know it might be stupid to open this issue, but I can't find the Discussion tab on Github to ask about this :)) Also I'm new to text2image and text23D things :)) Also, while waiting for my model to finish 5000 steps, I write this.

The readme reads:

Since the Imagen model is not publicly available, we use Stable Diffusion to replace it (implementation from diffusers). Different from Imagen, Stable-Diffusion is a latent diffusion model, which diffuses in a latent space instead of the original image space. Therefore, we need the loss to propagate back from the VAE's encoder part too, which introduces extra time cost in training. Currently, 10000 training steps take about 3 hours to train on a V100.

I understand that there is something that makes SD different from Google's Imagen, and that requires extra conversion and therefore extra time needed for each iteration.

So my question is: instead of using SD, can we use [Imagen-pytorch], an open source implementation of Google's Imagen in Pytorch, to generate the image? Will that reduce the training time?

Thank you! And thanks for this wonderful repo!

ashawkey commented 1 year ago

@raspiduino Hi, this is simply because there is no publicly available pretrained checkpoint for Imagen (In fact stable-diffusion is the only large pretrained text-to-image model we can access).

raspiduino commented 1 year ago

Thank you for replying! Can I ask another question?

I saw the option to use CLIP instead of StableDiffusion. I tried it (passing parameter to the main.py), but the generated 3D image has really low quality, even after I trained with 5000 steps (which should be OK to use with StableDiffusion)

I saw CLIP version works much faster (10 min), but its quality is really bad. So my question is why it did that bad quality, and how to improve it? Thanks!

ashawkey commented 1 year ago

CLIP guidance is in fact the previous work dreamfields, and its quality is indeed worse. You could find some good examples here: https://github.com/shengyu-meng/dreamfields-3D