Great work! when are you planning to release image-to-video models?

THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Apache License 2.0

9.32k stars 878 forks source link

Great work! when are you planning to release image-to-video models? #88

Closed gxd1994 closed 2 months ago

zRzRzRzRzRzRzR commented 3 months ago

Thank you for your support. However, this might take some time as we currently do not have any related plans in the near future. Thank you for your understanding.

StarCycle commented 3 months ago

Hi @zRzRzRzRzRzRzR

But as you mentioned in your paper, you already have an image-to-video version of CogVideoX

tengjiayan20 commented 3 months ago

Hi @zRzRzRzRzRzRzR

But as you mentioned in your paper, you already have an image-to-video version of CogVideoX

Yes, the above reply means that we do not plan to open source the image-generated video model in the near future. Please pay attention and look forward to it.

matbee-eth commented 3 months ago

Hi @zRzRzRzRzRzRzR But as you mentioned in your paper, you already have an image-to-video version of CogVideoX

Yes, the above reply means that we do not plan to open source the image-generated video model in the near future. Please pay attention and look forward to it.

Would absolutely appreciate the release of the img+text to video :)

StarCycle commented 3 months ago

Hi @tengjiayan20,

Thank you for the response!

Is it difficult to finetune an image-to-video model by myself on the WebVid10M dataset? How many samples and trainning steps do you need to do that?

Do you apply a fixed noise level on the image condition in the diffusion process?

Sorry but I really need a image2video model in my application.

Best wishes, StarCycle

tengjiayan20 commented 3 months ago

Hi @tengjiayan20,

Thank you for the response!

Is it difficult to finetune an image-to-video model by myself on the WebVid10M dataset? How many samples and trainning steps do you need to do that?

Do you apply a fixed noise level on the image condition in the diffusion process?

Sorry but I really need a image2video model in my application.

Best wishes, StarCycle

I think it is ok. After all, many image-to-video works have verified that webvid dataset can satisfy this task. The key is that they don't have a better base text-to-video model.
Augmentation in training is beneficial

eugenelyj commented 3 months ago

@tengjiayan20 Dear author, Do you apply the same noise level for all the timestep training, or do you apply timestep-dependent noise adding when training the image-to-video model?

tengjiayan20 commented 3 months ago

@tengjiayan20 Dear author, Do you apply the same noise level for all the timestep training, or do you apply timestep-dependent noise adding when training the image-to-video model?

Usually the strength of augmentation is random and dynamic. Since the augmentation is added to the image condition, and the image condition is constant during sampling and does not change with the timestep, I think the strength of augmentation does not need to change with the timestep. It is just to enhance robustness to fill the gap between the conditions during training and inference. But of course, you can try it, maybe it would be better in practice.

jinhuaca commented 3 months ago

Do you do conditioning similarly in i2v models as compared to t2v models? For example, do you concatenate the image embeddings (instead of text embeddings) with the video tokens as conditioning? Or instead, do you replace the first frame of the latent model with the image?

eugenelyj commented 3 months ago

@tengjiayan20 Dear author, one more question....Do you joint training the text-to-image when you fine-tune the image-to-video? Because for image-to-video, the latent channel have been doubled (concat with first frame), I am confused how to do it.

wangqiang9 commented 2 months ago

@tengjiayan20 Dear author, one more question....Do you joint training the text-to-image when you fine-tune the image-to-video? Because for image-to-video, the latent channel have been doubled (concat with first frame), I am confused how to do it.

SVD use a conv to upscale the doubled channels...

eugenelyj commented 2 months ago

@zRzRzRzRzRzRzR Dear author, can you share the implementation of add_noise_to_first_frame, in particular the detailed parameters (distribution) of the added noise.

zRzRzRzRzRzRzR commented 2 months ago

yes,check our sat code now, is in pr

zRzRzRzRzRzRzR commented 2 months ago

And I2V model will opensource in next 24 hours, this issue close