Long-range frames generation

jiangzhengkai commented 8 months ago

Hi, I want know how you handle long-range frames generation based on SVD. Do you use first frame as ref image or just random select from the whole video?

jiangzhengkai commented 8 months ago

@bendanzzc You discard the first frame result in your ppl? Why use 13 frames results instead of 14 frames in total

bendanzzc commented 8 months ago

I random select ref image from the same video to break the law that svd use ref-image to generate first frame.

In principle, I would use 14 frames. When I was writing the code, the model was not fully trained, and the quality of the first frame was poor, so it was not used. However, it should be removed now.

jiangzhengkai commented 8 months ago

@bendanzzc Thanks for your reply. I have trained both unet and controlnet on fashion dataset by random selecting ref image and using long-range generation ppl. however, the results are not good especially for appearance.

https://github.com/bendanzzc/AnimateAnyone-reproduction/assets/19780166/d3341494-940c-41cc-86b3-a9a8db4a1bc4

jiangzhengkai commented 8 months ago

training resolution is 512x512 and I train 50k steps in total.

bendanzzc commented 8 months ago

Ensuring the aspect ratio of the original image may help the results. In addition, VAE is very sensitive to size, so you can try a larger resolution. Or wait for us to release the basic model for finetuning.

bendanzzc commented 8 months ago

Yes, just like the size in inference code, but I am not sure that is the key point. Or you can train origin SVD-controlnet which like V0.9, if everyting is ok ,it will converge very fast

jiangzhengkai commented 8 months ago

@bendanzzc In your V0.9, you use the first frame as ref frame? And about dataset, do you use the original resize implemenation or your own ratio-keep implementation?

bendanzzc commented 8 months ago

yes, in v0.9, I use first frame as ref image, and I use ratio-keep implementation, and only train controlnet for a day with 2 GPUs

jiangzhengkai commented 8 months ago

@bendanzzc How do you inference?

bendanzzc commented 8 months ago

like original svd, only can inference 14 frames pre ref-image. So it can not generate complete video from a single ref-image. It is just a quick demo

jiangzhengkai commented 8 months ago

@bendanzzc Above video results are 256x256 training setting. Have you compare different resolutions?

bendanzzc commented 8 months ago

yes, all resolutions Iv tested are all bigger than 512, and keep the ratio of video. bigger resolution benefits human face or other small textures to some extents.

jiangzhengkai commented 8 months ago

@bendanzzc How abount training resolution?

bendanzzc commented 8 months ago

same as the size in inference code or bigger. there is no much different

jiangzhengkai commented 8 months ago

https://github.com/bendanzzc/AnimateAnyone-reproduction/assets/19780166/c5f2ebe4-26f3-429e-9987-8ef4fe73d6e4

training resolution is very important for SVD, especially for human face [bs=1, 8gpu, 7k results].

bendanzzc commented 8 months ago

If you have better result, welcome to share to let us konw the advanced results

jiangzhengkai commented 8 months ago

https://github.com/bendanzzc/AnimateAnyone-reproduction/assets/19780166/f34e546e-4fd9-4243-8ba9-e2374b9c5275

Update results on different poses. Maybe Controlnet should be the reference net like animate anyaone, cause pose is easy to control. Only image latent for appearance is not enough.

bendanzzc commented 8 months ago

the result looks cool!

Your idea makes sense! In my opinion the controlnet already has an input of ref-image, which supports plus opeartion for unet and Unet itself has ref-image as well. I feel that it may to some extent achieve the reference function. I also feel that the data quailty and quantity are also a part of the reason. I am not sure and may be the experimental results matters.

If you have any new progress, welcome to share and let me learn advanced research.

By the way, I found that the clip embedding of ref-image does not work for SVD, I replace the embedding with face-id embedding like IP-adapter, and use face mask loss, wish to have better result for human face.

bendanzzc / AnimateAnyone-reproduction

Long-range frames generation #9