THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
8.1k stars 762 forks source link

I2V fine-tuning questions #384

Open yjhong89 opened 2 weeks ago

yjhong89 commented 2 weeks ago

Hi! I am currently fine-tuning I2V released model on the sat folder and need some advices for fine-tuning I2V model.

Noised image concat

Learnable positional embedding of ROPE

Learning rate and schedule

Thanks!

yzy-thu commented 1 day ago

Noised image concat: The performance are similar when concatenating images in different ways. Learnable positional embedding: We observed that removing the learnable position embedding does not have a significant impact on the results, so you can remove it directly and reinitialize a new one.