Closed jiangzhengkai closed 8 months ago
@bendanzzc You discard the first frame result in your ppl? Why use 13 frames results instead of 14 frames in total
I random select ref image from the same video to break the law that svd use ref-image to generate first frame.
In principle, I would use 14 frames. When I was writing the code, the model was not fully trained, and the quality of the first frame was poor, so it was not used. However, it should be removed now.
@bendanzzc Thanks for your reply. I have trained both unet and controlnet on fashion dataset by random selecting ref image and using long-range generation ppl. however, the results are not good especially for appearance.
training resolution is 512x512 and I train 50k steps in total.
Ensuring the aspect ratio of the original image may help the results. In addition, VAE is very sensitive to size, so you can try a larger resolution. Or wait for us to release the basic model for finetuning.
Yes, just like the size in inference code, but I am not sure that is the key point. Or you can train origin SVD-controlnet which like V0.9, if everyting is ok ,it will converge very fast
@bendanzzc In your V0.9, you use the first frame as ref frame? And about dataset, do you use the original resize implemenation or your own ratio-keep implementation?
yes, in v0.9, I use first frame as ref image, and I use ratio-keep implementation, and only train controlnet for a day with 2 GPUs
@bendanzzc How do you inference?
like original svd, only can inference 14 frames pre ref-image. So it can not generate complete video from a single ref-image. It is just a quick demo
@bendanzzc Above video results are 256x256 training setting. Have you compare different resolutions?
yes, all resolutions Iv tested are all bigger than 512, and keep the ratio of video. bigger resolution benefits human face or other small textures to some extents.
@bendanzzc How abount training resolution?
same as the size in inference code or bigger. there is no much different
training resolution is very important for SVD, especially for human face [bs=1, 8gpu, 7k results].
If you have better result, welcome to share to let us konw the advanced results
Update results on different poses. Maybe Controlnet should be the reference net like animate anyaone, cause pose is easy to control. Only image latent for appearance is not enough.
the result looks cool!
Your idea makes sense! In my opinion the controlnet already has an input of ref-image, which supports plus opeartion for unet and Unet itself has ref-image as well. I feel that it may to some extent achieve the reference function. I also feel that the data quailty and quantity are also a part of the reason. I am not sure and may be the experimental results matters.
If you have any new progress, welcome to share and let me learn advanced research.
By the way, I found that the clip embedding of ref-image does not work for SVD, I replace the embedding with face-id embedding like IP-adapter, and use face mask loss, wish to have better result for human face.
Hi, I want know how you handle long-range frames generation based on SVD. Do you use first frame as ref image or just random select from the whole video?