I2V fine-tuning questions

Hi! I am currently fine-tuning I2V released model on the sat folder and need some advices for fine-tuning I2V model.

https://github.com/THUDM/CogVideo/blob/a9a55462f363a9f7ef2ba0364ba08a87a7c439bc/sat/diffusion_video.py#L59
Noised image concat is false so noised image is only concatenated to the first frame and zero-padded to other frames.
How this thing affect to I2V model result ?

Released I2V pretrained model use [480,720] resolution, but I trying to train with [512,512] resolution.
https://github.com/THUDM/CogVideo/blob/a9a55462f363a9f7ef2ba0364ba08a87a7c439bc/sat/dit_video_concat.py#L297
So when I load pretrained model, I use zero-initialized learnable pos embedding rather than using pretrained learnable positional embedding weight. Is this right way??
- Or is there any way use pretrained learable positional embedding weight while fine-tuning with other resolution ? (It goes different sequence length)

I am using 100k video dataset to fine-tune I2V model.
And learning rate is set to 1e-5 and learning rate scheduler is AnnealingLR defined in SAT package.
What is the proper learning rate and scheduler for fine-tuning I2V model ?? and how long iteration would be enough ?? (I set 100k iteration for now)

Thanks!

THUDM / CogVideo