Open SamitHuang opened 7 months ago
During downsampling, we retain the first frame because we consider the first frame of a video to be an image, consistent with CausalConv. Therefore, during training, the input number of video frames should be odd, such as 17. For upsampling, we avoid using convolutions in order to maximize the utilization of the weights from the image VAE, and we have made improvements to the upsampling process in subsequent training.
"we consider the first frame of a video to be an image..." I see, the first frame is always encoded from the repeated k-1 1st frames. But for upsampling, the interpolation between the first frame and the second frame doesn't conflict this padding logic.
Let's say the input of Upsample is z of shape (b c t h w) = (1 512 4 64 64), if we directly upsample the whole temporal sequence from 4 to 8, the output first frame will still correspond to the first image.
In the downsampling of the encoder, the first frame remains unchanged, so consistency should be maintained in the decoder. Moreover, if the first frame is repeated in the Upsample layer, it will cause problems in subsequent CausalConv3D layers. This is just my understanding, not necessarily reasonable.
Due to the first frame is excluded from interpolation in TimeUpsamle2x, as follows in code,
decoder output frames != encoder input frames, which looks wrong for reconstruction tasks, e.g. input frames=16, output frames is 13 in running.
Why not directly upsample the whole temporal sequential?
Btw, will it be better to use convolution to do the upsampling?