ali-vilab / UniAnimate

Code for Paper "UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation".
https://unianimate.github.io/
524 stars 29 forks source link

Details of training #5

Open zhysora opened 2 weeks ago

zhysora commented 2 weeks ago

Thanks for your great work. Do you use a two-stage training strategy like animate-anyone, i.e. train the unet on image datasets and then train the motion module on the video. Or train the whole network on one-stage.

wangxiang1230 commented 2 weeks ago

Thanks for your great work. Do you use a two-stage training strategy like animate-anyone, i.e. train the unet on image datasets and then train the motion module on the video. Or train the whole network on one-stage.

Hi, thanks for your attention. we only train the model on videos without incorporating the image training and train the whole network on one-stage. We have submitted the code to company for approval, and the code is expected to be released today or tomorrow.

BugsMaker0513 commented 2 weeks ago

Thanks for your great work. Do you use a two-stage training strategy like animate-anyone, i.e. train the unet on image datasets and then train the motion module on the video. Or train the whole network on one-stage.

Hi, thanks for your attention. we only train the model on videos without incorporating the image training and train the whole network on one-stage. We have submitted the code to company for approval, and the code is expected to be released today or tomorrow.

how many videos are used in training?

wangxiang1230 commented 2 weeks ago

Hi, ~10K videos are used for training. More videos will result in better results.

yang19527 commented 2 weeks ago

Thanks for your great work. Do you have any requirements on the length of training videos? Like how long is it? Also for some long videos are today cropped into multiple? Are there any video quality requirements? For example, some video hand blur, need to filter out?

wangxiang1230 commented 2 weeks ago

Thanks for your great work. Do you have any requirements on the length of training videos? Like how long is it? Also for some long videos are today cropped into multiple? Are there any video quality requirements? For example, some video hand blur, need to filter out?

Hi, thanks for your attention. Since we train our model on 16/32 frames. We filter the videos less than 32 frames. For long video, we random and uniform sample the frames from long videos. For video quality, videos with a resolution larger than 768x512 are reserved.

yang19527 commented 2 weeks ago

Is random uniform sampling of long videos performed during data preprocessing or during training? Also, how do you solve the problem of blurred hand movements and background shaking in training videos?

yang19527 commented 2 weeks ago

For long videos, my idea is to use sliding Windows to cut the video into multiple videos for training during pre-processing. I wonder if it is feasible?

wangxiang1230 commented 2 weeks ago

For long videos, my idea is to use sliding Windows to cut the video into multiple videos for training during pre-processing. I wonder if it is feasible?

Yes, you can do this, but for ease of scaling to other frames, such as 64. I recommend reading from the original long video when training. If the video is very long, such as more than 20s, it can be cropped into multiple segments of about 8s. Thanks.

yang19527 commented 2 weeks ago

In other words, there is no need to deal with it in advance, during the stage2 training phase, just let the program crop according to the video length.