Closed ygtxr1997 closed 1 year ago
Another possible method is that calculating the L1 loss between ground-truth x(0)
and the decoded z(t-1)_pred
like:
$$||x(0)-D(z_{pred}(t-1))||_1$$
Directly decoding the intermediate denoised latent z(t-1)_pred
seems reasonable, but I'm not sure if it is right.
It seems I misunderstood the predicted noise. The noise are added to the start input z(0)
rather than z(t-1)
.
Simply speaking for vanilla Stable Diffusion, during training, given a
x(0)
, the first stage model encodes it into latentz(0)
. After adding noiseeps(t)
, we get nosiyz(t)
. The vanilla Stable Diffusion calculates the loss betweeneps(t)
and the UNet-predicted noisyeps(t)_pred
for timestept
. Usingeps(t)_pred
, we denoisez(t)
and getz(t-1)_pred
.Your paper proposes a
reconstruction
constraint loss, which calculates the L1 loss between ground-truthx(0)
and the decoded predictedz(0)_pred
like: $$||x(0) - D(z_{pred}(0))||_1$$However, obtaining
z(0)
fromz(t-1)_pred
requirest-1
denoising steps, which seems very time-cost during training (forward called byt-1
times for an input batch). My question is: Do you implement thereconstruction
loss by denoising fort-1
steps during training? Or are you using a more efficient method to getz(0)_pred
fromz(t-1)_pred
?