Closed gxd1994 closed 2 months ago
Hi @zRzRzRzRzRzRzR
But as you mentioned in your paper, you already have an image-to-video version of CogVideoX
Hi @zRzRzRzRzRzRzR
But as you mentioned in your paper, you already have an image-to-video version of CogVideoX
Yes, the above reply means that we do not plan to open source the image-generated video model in the near future. Please pay attention and look forward to it.
Hi @zRzRzRzRzRzRzR But as you mentioned in your paper, you already have an image-to-video version of CogVideoX
Yes, the above reply means that we do not plan to open source the image-generated video model in the near future. Please pay attention and look forward to it.
Would absolutely appreciate the release of the img+text to video :)
Hi @tengjiayan20,
Thank you for the response!
Is it difficult to finetune an image-to-video model by myself on the WebVid10M dataset? How many samples and trainning steps do you need to do that?
Do you apply a fixed noise level on the image condition in the diffusion process?
Sorry but I really need a image2video model in my application.
Best wishes, StarCycle
Hi @tengjiayan20,
Thank you for the response!
Is it difficult to finetune an image-to-video model by myself on the WebVid10M dataset? How many samples and trainning steps do you need to do that?
Do you apply a fixed noise level on the image condition in the diffusion process?
Sorry but I really need a image2video model in my application.
Best wishes, StarCycle
@tengjiayan20 Dear author, Do you apply the same noise level for all the timestep training, or do you apply timestep-dependent noise adding when training the image-to-video model?
@tengjiayan20 Dear author, Do you apply the same noise level for all the timestep training, or do you apply timestep-dependent noise adding when training the image-to-video model?
Usually the strength of augmentation is random and dynamic. Since the augmentation is added to the image condition, and the image condition is constant during sampling and does not change with the timestep, I think the strength of augmentation does not need to change with the timestep. It is just to enhance robustness to fill the gap between the conditions during training and inference. But of course, you can try it, maybe it would be better in practice.
Do you do conditioning similarly in i2v models as compared to t2v models? For example, do you concatenate the image embeddings (instead of text embeddings) with the video tokens as conditioning? Or instead, do you replace the first frame of the latent model with the image?
@tengjiayan20 Dear author, one more question....Do you joint training the text-to-image when you fine-tune the image-to-video? Because for image-to-video, the latent channel have been doubled (concat with first frame), I am confused how to do it.
@tengjiayan20 Dear author, one more question....Do you joint training the text-to-image when you fine-tune the image-to-video? Because for image-to-video, the latent channel have been doubled (concat with first frame), I am confused how to do it.
SVD use a conv to upscale the doubled channels...
@zRzRzRzRzRzRzR Dear author, can you share the implementation of add_noise_to_first_frame
, in particular the detailed parameters (distribution) of the added noise.
yes,check our sat code now, is in pr
And I2V model will opensource in next 24 hours, this issue close
Thank you for your support. However, this might take some time as we currently do not have any related plans in the near future. Thank you for your understanding.