EnVision-Research / Lotus

Official Implementation of LOTUS: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
https://lotus3d.github.io
Apache License 2.0
469 stars 22 forks source link

LOTUS prediction type #15

Open wonjunior opened 1 month ago

wonjunior commented 1 month ago

Thank you for sharing your work! Regarding the prediction type, you mention training from the original Stable Diffusion 2.0 model if I am not mistaken, your objective type is x0. Isn't it an issue to have an objective type differ from that of the pre-trained model? Isn't it converging worse than remaining in a v-prediction setting?

BlingHe commented 1 month ago

Thanks for your attention!

Using a different objective function during fine-tuning is permissible, much like fine-tuning other foundational vision models such as Transformer and ResNet. The pre-trained SD model provides powerful visual priors that enhance zero-shot generalization in downstream tasks. Given the differences between pre-training and fine-tuning tasks, it is often necessary to adopt a more appropriate objective function. As stated in our paper, "the original settings for image generation are no longer the optimal solution for downstream dense prediction tasks." Investigating which objective function is best suited for dense prediction is one of our key contributions. We analyze this in Section 4, where x0 demonstrates superior performance compared to epsilon and v, as shown in Figures 6 and 11.

Best,