Why latent encode in stablediffusion needs grad?

ashawkey / stable-dreamfusion

Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion.

Apache License 2.0

8.04k stars 713 forks source link

Why latent encode in stablediffusion needs grad? #171

Open cwchenwang opened 1 year ago

cwchenwang commented 1 year ago

I think the encoding part is part of the diffusion model and doesn't need to train. But why you are training here?

jfozard commented 1 year ago

It is possible to apply the loss on the decoded image (and not pass the gradients through the vae encoder). However, in my experience the results weren't as good, and it's not faithful to the description in the original Dreamfusion paper. I think this alternative is mentioned in the Score Jacobian Chaining paper.

DeweiHu commented 1 year ago

Hi, I wonder if you have figured this out. I don't get why the encoder need to be updated no matter where the loss is conducted upon. Image space loss would probably resulted in a blurry/over-saturated outcome as there is no constraint on the consistency between the generated images.