Condition on first frame

Weifeng-Chen / control-a-video

Official Implementation of "Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models"

GNU General Public License v3.0

359 stars 26 forks source link

Hi, great work! You write that you condition the model on the first frame but I don't understand how it is done. The only thing in the code I can see is that you set the first latent to that proper first frame latent and that's all. How is the model going to learn based on that? Where is at least the auto-regressive part? Shouldn't this condition be done by cross-attention? Thank you in advance.

hi, I just update the readme and the inference script. I hope it can help you. And yes, there's no cross-attention, simple replace the noise with a latent during training and it can do the job when inference.

Weifeng-Chen / control-a-video

Condition on first frame #2