VisualComputingInstitute / diffusion-e2e-ft

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
https://gonzalomartingarcia.github.io/diffusion-e2e-ft/
132 stars 0 forks source link

How to finetune SD for depth / normal estimation using yoour method? #3

Open onpix opened 1 day ago

onpix commented 1 day ago

Hi, thanks for your great work! I have a question regarding the fine-tuning of the stable diffusion. As your Marigold code shows:

https://github.com/VisualComputingInstitute/diffusion-e2e-ft/blob/ad32ee3a529b50c5332f4290e0de4dd0ef0150ae/Marigold/marigold/marigold_pipeline.py#L447

You concat the RGB latent and noise latent (zero) as input, which is in the same format as Marigold. However, I am wondering how to fine-tune SD using your method for single image depth / normal prediction? Since SD's input is like:

output = unet(noise_latent, ...)

If setting noise_latent to zero, the model has no image conditioned; but if setting the input to RGB latent, like:

output = unet(rgb_latent, ...)

We found that the results are also bad (with significant artifacts). Thus, could you please advise how to implement your method to fine-tune SD for depth / normal prediction? Thanks!

GonzaloMartinGarcia commented 23 hours ago

Hi,

To minimize discrepancies between E2E FT Marigold and E2E FT Stable Diffusion (i.e., without Marigold diffusion estimation pretraining), we made sure that the Stable Diffusion fine-tuning follows the same training setup. This means that, just like Marigold, the Stable Diffusion's UNet input channels were doubled from 4 to 8, and the weights and biases were divided by two, so that the exact same training is applied. Since the input of the new layers are zeros, they should not influence model training. Different to the Marigold-E2E-FT run, we initialized with the Stable Diffusion checkpoint.

As a minor consequence, from the perspective of the first convolution at the start of training, this means that the magnitude of the activation in the first layer is scaled down by a factor of 2.

Hope this helps. We plan to release the evaluation and training code soon.