MyNiuuu / MOFA-Video

[ECCV 2024] MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model.
https://myniuuu.github.io/MOFA_Video
Other
626 stars 35 forks source link

About train dataset processing #28

Closed TomSuen closed 4 months ago

TomSuen commented 4 months ago

Hi, thank you for such a wonderful work!I would like to ask a question about the preparation of training sets. Notice that you mentioned in the paper During training, we randomly sample 14 video frames with a stride of 4. ...with a resolution of 256 × 256. We first train ... and directly taking the first frame together with the estimated optical flow from Unimatch.

So my question is what is the number of nms_ks in the flow_sampler function you used in the watershed sampler? I set it to 3 to get sampling points as much as possible,but it hard to just use these points to reconstruct the ori video, is this normal?

Btw,I found that one of the possible reasons is that the mask are all taken from the first frame. If an object in the first frame does not move, it is difficult for the Watershed algorithm to sample there, resulting in a lack of guidance for this object in the sparse flow guidance sequence, so the reconstruction effect is not ideal, right?

MyNiuuu commented 4 months ago

So my question is what is the number of nms_ks in the flow_sampler function you used in the watershed sampler? I set it to 3 to get sampling points as much as possible,but it hard to just use these points to reconstruct the ori video, is this normal?

We set nms_ks=15 during the training process. I think nms_ks=3 may be a little small to train the model.

Btw,I found that one of the possible reasons is that the mask are all taken from the first frame. If an object in the first frame does not move, it is difficult for the Watershed algorithm to sample there, resulting in a lack of guidance for this object in the sparse flow guidance sequence, so the reconstruction effect is not ideal, right?

Yes, the watershed algorithm is unable to sample points that remain stationary in the initial frame. However, this may not significantly impact the model's training, as it is unnecessary and even inadvisable to sample every moving part throughout the video.

TomSuen commented 4 months ago

So my question is what is the number of nms_ks in the flow_sampler function you used in the watershed sampler? I set it to 3 to get sampling points as much as possible,but it hard to just use these points to reconstruct the ori video, is this normal?

We set nms_ks=15 during the training process. I think nms_ks=3 may be a little small to train the model.

Btw,I found that one of the possible reasons is that the mask are all taken from the first frame. If an object in the first frame does not move, it is difficult for the Watershed algorithm to sample there, resulting in a lack of guidance for this object in the sparse flow guidance sequence, so the reconstruction effect is not ideal, right?

Yes, the watershed algorithm is unable to sample points that remain stationary in the initial frame. However, this may not significantly impact the model's training, as it is unnecessary and even inadvisable to sample every moving part throughout the video.

Okay,thank you for ur reply, When will you open source the training code?

TomSuen commented 4 months ago

And I have another question, if I resize the (336,596) video to (256,256) to predict the optical flow, and if I predict the optical flow with the original size video and then resize it to (256,256), will there be any difference between the two optical flow graphs? Normally, unimatch should not be too sensitive to size.

MyNiuuu commented 4 months ago

And I have another question, if I resize the (336,596) video to (256,256) to predict the optical flow, and if I predict the optical flow with the original size video and then resize it to (256,256), will there be any difference between the two optical flow graphs? Normally, unimatch should not be too sensitive to size.

We found that Unimatch is actually (relatively) sensitive to input size. Unimatch produces sharper predictions when resizing the input frames to its training size: [384, 512], and we adopted this setting during the training process.

TomSuen commented 4 months ago

Thx again, forgive me for having so many questions. I noticed that the pre-trained models you provided are all for 25 frames. Can I fine-tune them on 14 frames of data?

MyNiuuu commented 4 months ago

Thx again, forgive me for having so many questions. I noticed that the pre-trained models you provided are all for 25 frames. Can I fine-tune them on 14 frames of data?

Yes, you can finetune the model on 14 frames of data. I think it will not negatively impact the performance of the model.

MyNiuuu commented 4 months ago

So my question is what is the number of nms_ks in the flow_sampler function you used in the watershed sampler? I set it to 3 to get sampling points as much as possible,but it hard to just use these points to reconstruct the ori video, is this normal?

We set nms_ks=15 during the training process. I think nms_ks=3 may be a little small to train the model.

Btw,I found that one of the possible reasons is that the mask are all taken from the first frame. If an object in the first frame does not move, it is difficult for the Watershed algorithm to sample there, resulting in a lack of guidance for this object in the sparse flow guidance sequence, so the reconstruction effect is not ideal, right?

Yes, the watershed algorithm is unable to sample points that remain stationary in the initial frame. However, this may not significantly impact the model's training, as it is unnecessary and even inadvisable to sample every moving part throughout the video.

Okay,thank you for ur reply, When will you open source the training code?

I am starting to prepare the release of training codes since our paper has been accepted for ECCV'24. I think that the training codes will be made available within a week 🤔.

TomSuen commented 4 months ago

So my question is what is the number of nms_ks in the flow_sampler function you used in the watershed sampler? I set it to 3 to get sampling points as much as possible,but it hard to just use these points to reconstruct the ori video, is this normal?

We set nms_ks=15 during the training process. I think nms_ks=3 may be a little small to train the model.

Btw,I found that one of the possible reasons is that the mask are all taken from the first frame. If an object in the first frame does not move, it is difficult for the Watershed algorithm to sample there, resulting in a lack of guidance for this object in the sparse flow guidance sequence, so the reconstruction effect is not ideal, right?

Yes, the watershed algorithm is unable to sample points that remain stationary in the initial frame. However, this may not significantly impact the model's training, as it is unnecessary and even inadvisable to sample every moving part throughout the video.

Okay,thank you for ur reply, When will you open source the training code?

I am starting to prepare the release of training codes since our paper has been accepted for ECCV'24. I think that the training codes will be made available within a week 🤔.

Waooooo, great news😄

TomSuen commented 4 months ago

So my question is what is the number of nms_ks in the flow_sampler function you used in the watershed sampler? I set it to 3 to get sampling points as much as possible,but it hard to just use these points to reconstruct the ori video, is this normal?

We set nms_ks=15 during the training process. I think nms_ks=3 may be a little small to train the model.

Btw,I found that one of the possible reasons is that the mask are all taken from the first frame. If an object in the first frame does not move, it is difficult for the Watershed algorithm to sample there, resulting in a lack of guidance for this object in the sparse flow guidance sequence, so the reconstruction effect is not ideal, right?

Yes, the watershed algorithm is unable to sample points that remain stationary in the initial frame. However, this may not significantly impact the model's training, as it is unnecessary and even inadvisable to sample every moving part throughout the video.

Okay,thank you for ur reply, When will you open source the training code?

I am starting to prepare the release of training codes since our paper has been accepted for ECCV'24. I think that the training codes will be made available within a week 🤔.

Hi, I have other questions.

  1. For the first frame of optical flow, will the watershed sampler algorithm definitely sample the position of the maximum optical flow value?
  2. When training, do you want the sparse optical flow obtained after cmp to be exactly the same as the dense optical flow extracted from the original video? I mean, training is just to enable the model to learn the guiding role of any optical flow, not to completely restore the video, right?
MyNiuuu commented 4 months ago

So my question is what is the number of nms_ks in the flow_sampler function you used in the watershed sampler? I set it to 3 to get sampling points as much as possible,but it hard to just use these points to reconstruct the ori video, is this normal?

We set nms_ks=15 during the training process. I think nms_ks=3 may be a little small to train the model.

Btw,I found that one of the possible reasons is that the mask are all taken from the first frame. If an object in the first frame does not move, it is difficult for the Watershed algorithm to sample there, resulting in a lack of guidance for this object in the sparse flow guidance sequence, so the reconstruction effect is not ideal, right?

Yes, the watershed algorithm is unable to sample points that remain stationary in the initial frame. However, this may not significantly impact the model's training, as it is unnecessary and even inadvisable to sample every moving part throughout the video.

Okay,thank you for ur reply, When will you open source the training code?

I am starting to prepare the release of training codes since our paper has been accepted for ECCV'24. I think that the training codes will be made available within a week 🤔.

Hi, I have other questions.

  1. For the first frame of optical flow, will the watershed sampler algorithm definitely sample the position of the maximum optical flow value?
  2. When training, do you want the sparse optical flow obtained after cmp to be exactly the same as the dense optical flow extracted from the original video? I mean, training is just to enable the model to learn the guiding role of any optical flow, not to completely restore the video, right?

Sorry for the late reply, busy days.

For the first frame of optical flow, will the watershed sampler algorithm definitely sample the position of the maximum optical flow value?

Actually I am not very sure about the answer to this question, I think yes, will the watershed sampler algorithm definitely sample the position of the maximum optical flow value. You can check the codes now since I have just already released the training codes.

When training, do you want the sparse optical flow obtained after cmp to be exactly the same as the dense optical flow extracted from the original video?

No. This is the last thing I expect😂. This will make the model depend on flow too much and lack enough generation ability.

I mean, training is just to enable the model to learn the guiding role of any optical flow, not to completely restore the video, right?

Yes, our goal is to use a rough flow as guidance, that is to say, given an in accurate optical flow from CMP, the model can still generate semantically meaningful videos that correctly reflect the intention of the user.

By the way, I have just released the training codes, you can check for details.

Best regards,

TomSuen commented 4 months ago

Thank you very much for your kind reply! I feel I have almost fully understood your work.