SUDO-AI-3D / zero123plus

Code repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.
Apache License 2.0
1.56k stars 108 forks source link

scaling about reference attention #66

Closed Time-Lord12th closed 3 months ago

Time-Lord12th commented 4 months ago

hello, I am trying to fine-tune the model. I have some questions, could you please help me answer them?

  1. in pipeline.py#L403, there are unscale_latents and unscale_image. So in training, I need to do scale_image and scale_latents to get noisy latents? If so, why condition Images are not scaled in pipeline.py, since the two branches in Reference Attention Model use the same unet.
  2. in the report, there is the model achieves the highest consistency with the conditioning image when the reference latent is scaled by a factor of 5. But I haven't seen it implemented in the pipeline.py. Does it mean using 5xCondition Image Latents?
eliphatfs commented 4 months ago

For 1: you need to scale the noisy latents. However, the scale is only done on diffusion targets. Sharing the UNet doesn't necessarily mean that they need to be of the same scale.

For 2: Actually it is related to the latter part of 1. By default the SD VAE output needs to be rescaled by about 0.19 (vae.config.scaling_factor) before sending into diffusion; we skipped that step for the condition branch (so it is roughly scaled by a factor of 5).

This can look strange at a first glance, but it enhances the local conditioning signal and helps with final results empirically.

moonryul commented 2 months ago

This is Moon R. Jung. Let add some comments in this thread.

Q1. eliphatfs said: For 1: you need to scale the noisy latents. However, the scale is only done on diffusion targets. => Does it mean that we need not scale the noisy latents but the diffusion targets? Then what does it mean?