Problem of TimeUpsamle2x, decoder output frames != encoder input frames

PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

MIT License

11.5k stars 1.02k forks source link

Problem of TimeUpsamle2x, decoder output frames != encoder input frames #174

Open SamitHuang opened 7 months ago

SamitHuang commented 7 months ago

Due to the first frame is excluded from interpolation in TimeUpsamle2x, as follows in code,

x,x_= x[:,:,:1],x[:,:,1:]
x_= F.interpolate(x_, scale_factor=(2,1,1), mode='trilinear')
x = torch.concat([x, x_], dim=2)

decoder output frames != encoder input frames, which looks wrong for reconstruction tasks, e.g. input frames=16, output frames is 13 in running.

Why not directly upsample the whole temporal sequential?

x = F.interpolate(x, scale_factor=(2,1,1), mode='trilinear')

Btw, will it be better to use convolution to do the upsampling?

qqingzheng commented 7 months ago

During downsampling, we retain the first frame because we consider the first frame of a video to be an image, consistent with CausalConv. Therefore, during training, the input number of video frames should be odd, such as 17. For upsampling, we avoid using convolutions in order to maximize the utilization of the weights from the image VAE, and we have made improvements to the upsampling process in subsequent training.

SamitHuang commented 7 months ago

"we consider the first frame of a video to be an image..." I see, the first frame is always encoded from the repeated k-1 1st frames. But for upsampling, the interpolation between the first frame and the second frame doesn't conflict this padding logic.

Let's say the input of Upsample is z of shape (b c t h w) = (1 512 4 64 64), if we directly upsample the whole temporal sequence from 4 to 8, the output first frame will still correspond to the first image.

qqingzheng commented 6 months ago

In the downsampling of the encoder, the first frame remains unchanged, so consistency should be maintained in the decoder. Moreover, if the first frame is repeated in the Upsample layer, it will cause problems in subsequent CausalConv3D layers. This is just my understanding, not necessarily reasonable.