MyNiuuu / MOFA-Video

Official Pytorch implementation for MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model.
https://myniuuu.github.io/MOFA_Video
Other
359 stars 22 forks source link

Clarifications on experiments #7

Closed songwoh closed 3 weeks ago

songwoh commented 3 weeks ago

Hi,

I have several questions with regards to the ablation settings and training details.

(1) Could you elaborate on the experiment condition for non-tuning model? (Section 4.2) Does this setting include training the reference encoder, and warping features with dense optical flow estimated through Unimatch? Or, is this purely a tuning-free model (without ControlNet) where the dense optical flow is concatenated with the reference frame and given as condition to SVD?

(2) Could you elaborate on the first stage of training the model in Implementation Details section? (Section 4) The paper says, "We first train the model as a flow-based reconstruction model by removing the S2D motion generator and directly taking the first frame together with the estimated optical flow from Unimatch" The spirit of the question is same as question (1). Does this mean warping the features from reference encoder using the dense optical flow from Unimatch?

Thank you.

MyNiuuu commented 3 weeks ago

(1) Experimental conditions for the non-tuning model

Does this setting include training the reference encoder and warping features with dense optical flow estimated through Unimatch?

Yes, as stated in Section 4.2, the non-tuning model directly uses dense optical flow estimated from Unimatch to warp the features of the reference encoder. The ControlNet-based architecture is still adopted.

(2) The first stage of training the model

The spirit of the question is same as question (1). Does this mean warping the features from reference encoder using the dense optical flow from Unimatch?

Yes, during the first training stage, we warp the features from the reference encoder using the dense optical flow from Unimatch.

In the ablation experiment, the non-tuning model refers to the direct use of the stage one model for inference without stage two training, which demonstrates the necessity of stage two.

LPengYang commented 2 weeks ago

Thanks for your wondeful work. Since the non-tuning model does not invlove the stage two, where does the dense flow during inference comes from? It looks like that Unimatch cannot generated optical flow with only first frame.

MyNiuuu commented 2 weeks ago

Thanks for your wondeful work. Since the non-tuning model does not invlove the stage two, where does the dense flow during inference comes from? It looks like that Unimatch cannot generated optical flow with only first frame.

To run inference for the non-tuning model, we utilize the dense flow predictions from the S2D network. This experiment aims to demonstrate that fine-tuning the stage one model is necessary for achieving superior performance.