IrisRainbowNeko / DreamArtist-stable-diffusion

stable diffusion webui with contrastive prompt tuning
876 stars 53 forks source link

How do you calculate 'reconstruction' constraint loss? #31

Closed ygtxr1997 closed 1 year ago

ygtxr1997 commented 1 year ago

Simply speaking for vanilla Stable Diffusion, during training, given a x(0), the first stage model encodes it into latent z(0). After adding noise eps(t), we get nosiy z(t). The vanilla Stable Diffusion calculates the loss between eps(t) and the UNet-predicted noisy eps(t)_pred for timestep t. Using eps(t)_pred, we denoise z(t) and get z(t-1)_pred.

Your paper proposes a reconstruction constraint loss, which calculates the L1 loss between ground-truth x(0) and the decoded predicted z(0)_pred like: $$||x(0) - D(z_{pred}(0))||_1$$

However, obtaining z(0) from z(t-1)_pred requires t-1 denoising steps, which seems very time-cost during training (forward called by t-1 times for an input batch). My question is: Do you implement the reconstruction loss by denoising for t-1 steps during training? Or are you using a more efficient method to get z(0)_pred from z(t-1)_pred?

ygtxr1997 commented 1 year ago

Another possible method is that calculating the L1 loss between ground-truth x(0) and the decoded z(t-1)_pred like: $$||x(0)-D(z_{pred}(t-1))||_1$$ Directly decoding the intermediate denoised latent z(t-1)_pred seems reasonable, but I'm not sure if it is right.

ygtxr1997 commented 1 year ago

It seems I misunderstood the predicted noise. The noise are added to the start input z(0) rather than z(t-1).