THUDM / CogVideo

Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
7.17k stars 659 forks source link

Great work! when are you planning to release image-to-video models? #88

Open gxd1994 opened 1 month ago

zRzRzRzRzRzRzR commented 1 month ago

Thank you for your support. However, this might take some time as we currently do not have any related plans in the near future. Thank you for your understanding.

StarCycle commented 1 month ago

Hi @zRzRzRzRzRzRzR

But as you mentioned in your paper, you already have an image-to-video version of CogVideoX

图片

tengjiayan20 commented 1 month ago

Hi @zRzRzRzRzRzRzR

But as you mentioned in your paper, you already have an image-to-video version of CogVideoX

图片

Yes, the above reply means that we do not plan to open source the image-generated video model in the near future. Please pay attention and look forward to it.

matbee-eth commented 1 month ago

Hi @zRzRzRzRzRzRzR But as you mentioned in your paper, you already have an image-to-video version of CogVideoX 图片

Yes, the above reply means that we do not plan to open source the image-generated video model in the near future. Please pay attention and look forward to it.

Would absolutely appreciate the release of the img+text to video :)

StarCycle commented 1 month ago

Hi @tengjiayan20,

Thank you for the response!

Is it difficult to finetune an image-to-video model by myself on the WebVid10M dataset? How many samples and trainning steps do you need to do that?

Do you apply a fixed noise level on the image condition in the diffusion process?

Sorry but I really need a image2video model in my application.

Best wishes, StarCycle

tengjiayan20 commented 1 month ago

Hi @tengjiayan20,

Thank you for the response!

Is it difficult to finetune an image-to-video model by myself on the WebVid10M dataset? How many samples and trainning steps do you need to do that?

Do you apply a fixed noise level on the image condition in the diffusion process?

Sorry but I really need a image2video model in my application.

Best wishes, StarCycle

  1. I think it is ok. After all, many image-to-video works have verified that webvid dataset can satisfy this task. The key is that they don't have a better base text-to-video model.
  2. Augmentation in training is beneficial
eugenelyj commented 1 month ago

@tengjiayan20 Dear author, Do you apply the same noise level for all the timestep training, or do you apply timestep-dependent noise adding when training the image-to-video model?

tengjiayan20 commented 1 month ago

@tengjiayan20 Dear author, Do you apply the same noise level for all the timestep training, or do you apply timestep-dependent noise adding when training the image-to-video model?

Usually the strength of augmentation is random and dynamic. Since the augmentation is added to the image condition, and the image condition is constant during sampling and does not change with the timestep, I think the strength of augmentation does not need to change with the timestep. It is just to enhance robustness to fill the gap between the conditions during training and inference. But of course, you can try it, maybe it would be better in practice.

jinhuaca commented 1 month ago

Do you do conditioning similarly in i2v models as compared to t2v models? For example, do you concatenate the image embeddings (instead of text embeddings) with the video tokens as conditioning? Or instead, do you replace the first frame of the latent model with the image?

eugenelyj commented 3 weeks ago

@tengjiayan20 Dear author, one more question....Do you joint training the text-to-image when you fine-tune the image-to-video? Because for image-to-video, the latent channel have been doubled (concat with first frame), I am confused how to do it.

wangqiang9 commented 1 week ago

@tengjiayan20 Dear author, one more question....Do you joint training the text-to-image when you fine-tune the image-to-video? Because for image-to-video, the latent channel have been doubled (concat with first frame), I am confused how to do it.

SVD use a conv to upscale the doubled channels...