Closed Xiang-cd closed 7 months ago
SNR / (1 + SNR)
which means eps loss would be smaller. This means nothing about the reconstruction quality, but is defined by the nature of Gaussian diffusion.I was wondering what is the expected loss scale level using v-prediction and eps-prediction, for my discovery, when I use esp the loss is around 1, while using v, the loss is around 15. I think both are too large.
I also found vae_scale_factor is also different from most used autoencoder-kl(scale factor is less than 1), also there are scale_latents and scale_image, what is that for?
These are tricks to bias the model towards more global coherence (in contrast to focusing on local details), which is important for multi-view generation.
great work! I see the paper said
Therefore, we have opted to utilize the Stable Diffusion 2 v-prediction model as our base model for fine-tuning
, but the code uses the sample call function with stablediffusion, moreover the lambdalabs/sd-image-variations-diffusers model is default as an eps prediction model, how the model is transferred?I have tried to verify whether the release model is v-prediction or eps-prediction by adding noise and computing loss with the ground truth of eps and v, I found that the loss of eps is smaller, is that true?