Open wonjunior opened 1 month ago
Thanks for your attention!
Using a different objective function during fine-tuning is permissible, much like fine-tuning other foundational vision models such as Transformer and ResNet. The pre-trained SD model provides powerful visual priors that enhance zero-shot generalization in downstream tasks. Given the differences between pre-training and fine-tuning tasks, it is often necessary to adopt a more appropriate objective function. As stated in our paper, "the original settings for image generation are no longer the optimal solution for downstream dense prediction tasks." Investigating which objective function is best suited for dense prediction is one of our key contributions. We analyze this in Section 4, where x0 demonstrates superior performance compared to epsilon and v, as shown in Figures 6 and 11.
Best,
Thank you for sharing your work! Regarding the prediction type, you mention training from the original Stable Diffusion 2.0 model if I am not mistaken, your objective type is x0. Isn't it an issue to have an objective type differ from that of the pre-trained model? Isn't it converging worse than remaining in a v-prediction setting?