SUDO-AI-3D / zero123plus

Code repository for Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.
Apache License 2.0
1.56k stars 108 forks source link

Is the training based on LoRA or just tune the original model parameters! #44

Closed WillowKaze closed 6 months ago

WillowKaze commented 7 months ago

Excellent work! I was ispired by it and want to try it with some other work. I understand the you for not release the training code, but I will appreciate if you could tell more about the details.

  1. The fine-tuning is with the help of LoRA or just the original parameters? will use LoRA get a relatively bad results?
  2. How could I train the controlnet if I have some other conditions? Is it a off-the-shelf one or trained one? Can I use a T2I-Adapter instead? Is the training process before or after the training of denoising UNet?
  3. How far could the noise schedule effect? Is it possible that I use \epsilon-prediction and original noise-schedule but still get a satisfying results?

Thank you!

WillowKaze commented 7 months ago

And another thing that confuses me is that "scaled by a factor of 5" in the paper, is that implemented in the code? I could not find that.

eliphatfs commented 7 months ago
  1. Original parameters; LoRA would barely work if the rank is as low as 4, but can work to some extent if the rank is more than 64.
  2. ControlNet generally assume local spatial correlations between input condition and output image, so you will need multi-view control images; for off-the-shelf ones you may want to try https://github.com/haofanwang/ControlNet-for-Diffusers; I am not familiar with T2I-Adapter so I cannot tell now. The training process is after the training of UNet, the same as regular ControlNets.
  3. It is not possible; in general you will need other tricks such as Gaussian blob initialization (I am referring to the Instant3D paper by Adobe) to provide more global clues if you do not change the schedule. I think changing the schedule is the principly-correct way to go though. Epsilon models can be used to enhance local details, which is one of the future works I mentioned in Zero123++ report and I am currently working on.
eliphatfs commented 7 months ago

By default the SD VAE output needs to be rescaled by about 0.18 (vae.config.scaling_factor) before sending into diffusion; we skipped that step for the condition branch (so it is roughly scaled by a factor of 5), and we have an extra function called scale_latents that normalizes the residual by shifting and rescaling the latents according to statistics we compute with Objaverse renders.