In the diag
it should be VLDM instead of LDM right ?
In base stage, how does LDM is generating video from input image. Generally LDM uses 2D U-net which are capable of generating images only right ?. Let's say if its an VLDM which uses 3D Unet then input should mulitple frames of noise images right ?
In refinement stage, For each frame are we applying diffusion and denoise process ? Here also we are using LDM which again uses 2D convolution operations but for temporal coherence we need 3D convolutions right ?
I think I am missing something, can you please help me here.
Thanks a lot in advance.
In the base stage, we input the image (extracting CLIP features and latent represents separately) and combine it with noise to input into the 3D U-Net to get the output video.
In practice, we treat the video as a whole during input and use a denoising and diffusion process. For the temporal encoding process, you can refer to the design of our 3D U-Net, Fig.3.
Thank you.
So input to refinement stage pre-trained LDM is resized output of base stage LDM ? Because any LDM takes noise as input and denoises it to generate video if it takes resized video (non-noise) as input, how it is denoises when there is no noise ?
Great work team. I have few questions
I think I am missing something, can you please help me here. Thanks a lot in advance.