Closed Time-Lord12th closed 3 months ago
For 1: you need to scale the noisy latents. However, the scale is only done on diffusion targets. Sharing the UNet doesn't necessarily mean that they need to be of the same scale.
For 2: Actually it is related to the latter part of 1. By default the SD VAE output needs to be rescaled by about 0.19 (vae.config.scaling_factor) before sending into diffusion; we skipped that step for the condition branch (so it is roughly scaled by a factor of 5).
This can look strange at a first glance, but it enhances the local conditioning signal and helps with final results empirically.
This is Moon R. Jung. Let add some comments in this thread.
Q1. eliphatfs said: For 1: you need to scale the noisy latents. However, the scale is only done on diffusion targets. => Does it mean that we need not scale the noisy latents but the diffusion targets? Then what does it mean?
hello, I am trying to fine-tune the model. I have some questions, could you please help me answer them?
unscale_latents
andunscale_image
. So in training, I need to doscale_image
andscale_latents
to get noisy latents? If so, why condition Images are not scaled inpipeline.py
, since the two branches in Reference Attention Model use the same unet.the model achieves the highest consistency with the conditioning image when the reference latent is scaled by a factor of 5
. But I haven't seen it implemented in thepipeline.py
. Does it mean using 5xCondition Image Latents?