SUDO-AI-3D / zero123plus

Code repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.
Apache License 2.0
1.71k stars 118 forks source link

v-prediction #56

Closed Xiang-cd closed 7 months ago

Xiang-cd commented 9 months ago

great work! I see the paper said Therefore, we have opted to utilize the Stable Diffusion 2 v-prediction model as our base model for fine-tuning, but the code uses the sample call function with stablediffusion, moreover the lambdalabs/sd-image-variations-diffusers model is default as an eps prediction model, how the model is transferred?

I have tried to verify whether the release model is v-prediction or eps-prediction by adding noise and computing loss with the ground truth of eps and v, I found that the loss of eps is smaller, is that true?

eliphatfs commented 9 months ago
  1. v-prediction is set in the scheduler config. Also, the base model is stable-diffusion-2 instead of image variations model. We applied the two-phase training technique from image variations, but that has nothing to do with the model itself.
  2. if you compare a loss value of eps pred to v pred, it would be theoretically (for a well-trained model) SNR / (1 + SNR) which means eps loss would be smaller. This means nothing about the reconstruction quality, but is defined by the nature of Gaussian diffusion.
Xiang-cd commented 9 months ago

I was wondering what is the expected loss scale level using v-prediction and eps-prediction, for my discovery, when I use esp the loss is around 1, while using v, the loss is around 15. I think both are too large.

I also found vae_scale_factor is also different from most used autoencoder-kl(scale factor is less than 1), also there are scale_latents and scale_image, what is that for?

eliphatfs commented 9 months ago

These are tricks to bias the model towards more global coherence (in contrast to focusing on local details), which is important for multi-view generation.