MyNiuuu / MOFA-Video

Official Pytorch implementation for MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model.
https://myniuuu.github.io/MOFA_Video
Other
358 stars 22 forks source link

Question about the mask in paper `Sparse Motion Vectors from Dense Optical Flow` #19

Closed LokiXun closed 4 days ago

LokiXun commented 5 days ago

Hi this work is so great and i wish to try Keypoints from driven-video! There is no corresponding inference script for Keypoints from driven-video. So I am trying to code the inference code by myself.

I met problems in generating sparse motion vectors from video's dense optical flow. It seems unclear how to get the mask with watershed sampling strategy for a given video. How many masks should we get if we have a 25 frame video? If total framesNum is 25, Is the mask should be taken frame by frame(for optical flow) like having 24 masks, OR the mask only need to use the mask in first frame?

Thanks.

MyNiuuu commented 5 days ago

There is no corresponding inference script for Keypoints from driven-video

We have inference script for keypoints from driven-video: https://github.com/MyNiuuu/MOFA-Video/blob/main/MOFA-Video-Hybrid/run_gradio_video_driven.py You can run the gradio demo to test the results.

For generating sparse motion vectors, you can refer to the code implementation around here. Also, please note that we do not use watershed sampling strategy during inference since we directly extract face landmarks from the video which serves as the sparse motion vectors.

MyNiuuu commented 4 days ago

Emmmm... Do you mean you want to use the keypoints extracted from an open-domain video to serve as a group of trajectories for trajectory-based control?

If in that case, I think you can modify the current codes based on these functions: sample_optical_flow, get_sparse_flow, and sample_inputs_face These functions provide details about how to sample mask based on a set of keypoints within a series of frames.

LokiXun commented 4 days ago

my mistaks, thanks.

May I ask that how we get the sparse motion vectors and the masks during training Sparse Motion Vectors from Dense Optical Flow. What should I do about the mask if i purely extract optical flows rather than landmarks from a given video. Is all frames' mask is the same or sampled from each frames?

LokiXun commented 4 days ago

Emmmm... Do you mean you want to use the keypoints extracted from an open-domain video to serve as a group of trajectories for trajectory-based control?

If in that case, I think you can modify the current codes based on these functions: sample_optical_flow, get_sparse_flow, and sample_inputs_face These functions provide details about how to sample mask based on a set of keypoints within a series of frames.

okay, thanks. I am just wondering what i should do if transform dense optical flow(from given video) as sparse motion vector

MyNiuuu commented 4 days ago

To generate a 25-frame video, let's assume you choose 10 spatial points from the initial frame. According to the forward warping concept, the 24 masks should be identical, with the coordinates of 10 spatial points marked as 1 and the remaining coordinates as 0.

You can then utilize these 24 masks to obtain the corresponding sparse motion vectors by performing element-wise multiplication between the masks and the forward dense optical flows.

LokiXun commented 4 days ago

To generate a 25-frame video, let's assume you choose 10 spatial points from the initial frame. According to the forward warping concept, the 24 masks should be identical, with the coordinates of 10 spatial points marked as 1 and the remaining coordinates as 0.

You can then utilize these 24 masks to obtain the corresponding sparse motion vectors by performing element-wise multiplication between the masks and the forward dense optical flows.

Okay, Now I got it, thanks. Hahaha