Question about the One-shot tuning Text-to-Video algorithm

G-U-N / Gen-L-Video

The official implementation for "Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising".

https://arxiv.org/abs/2305.18264

Apache License 2.0

285 stars 34 forks source link

Question about the One-shot tuning Text-to-Video algorithm #12

Closed fradino closed 1 year ago

fradino commented 1 year ago

Hello, I have some question about the pipeline of One-shot tuning Text-to-Video algorithm. I am confused about how the algorithm below is reflected in the One-shot tuning code. In the paper, it said 'The total number of frames of the video is S ∗ N + M' In the code, the for loop is 'for i in range(0,video_length-clip_length+1,clip_length):' Is it means S==M in this code? Thank you very much!

G-U-N commented 1 year ago

When decoding with the VAE or run_isolated is set to true, S=M. When using ddim_inversion_long or gen_long, S = Stride, M=clip_length.

And I just find a typo there, the total number of frames should be S*(N-1)+M. Thanks for your issue, and I will re-correct it when uploading a newer version.