We request that you cite our similar work, GenPercept (https://arxiv.org/abs/2403.06090v1).

VisualComputingInstitute / diffusion-e2e-ft

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think. Accepted to WACV 2025 and NeurIPS AFM Workshop.

326 stars 4 forks source link

Hi authors, thank you for your valuable contributions to the field. I am the author of GenPercept (https://arxiv.org/abs/2403.06090v1), which was submitted to arXiv in March 2024. I would like to share with you that our work has already proposed a very similar one-step deterministic finetuning strategy compared with your arxiv work "Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think" on broader tasks (depth/surface normal/segmentation/matting). Our framework can finetune pretrained diffusion models on either VAE decoder or customized task heads with customized supervision signals. To accurately reflect the state of the research, I kindly request that you cite our paper as a prior work in your manuscript. Thank you for your attention.

Pipeline:

Hugging Face link: https://huggingface.co/spaces/guangkaixu/GenPercept GitHub repo link: https://github.com/aim-uofa/GenPercept

As we mentioned in our response to your email, we will include an appropriate discussion in our next arXiv update. After you pointed out your arXiv paper to us, we have in the meantime taken a close look at your paper and we find that you come to very different conclusions in your work:

Similar to us, you directly fine-tune Stable Diffusion in an end-to-end fashion, however, we arrive to this point in a very different way. We initially discovered the issue with the DDIM scheduler, fixed this in Marigold, and in turn arrived to an end-to-end fine-tuning scheme that works for Marigold. While we show that this also works well for Stable Diffusion directly, this is not the main message of our paper. Your main contribution is that you can fine-tune Stable Diffusion (for a broader spectrum of tasks), however, even with additional modules on top, you achieve lower scores than some of the baselines, suggesting that end-to-end fine-tuning is possible, but not necessarily always the way to go. We show that end-to-end fine tuning, when done in a straight-forward way, with a fixed DDIM scheduler, can achieve very competitive scores without additional architecture modifications for monocular depth and normal estimation. As such we agree that your paper is relevant and we will add the above discussion to our paper, it does not change the main message of our paper though.

Closing this, since it's not an issue with the code.

VisualComputingInstitute / diffusion-e2e-ft

We request that you cite our similar work, GenPercept (https://arxiv.org/abs/2403.06090v1). #4