GFENGG commented 4 months ago

Hello, i found there are three stages in the training of videoVAE from report v1.1.0,

Similar to v1.0.0, we initialized from the Latent Diffusion's VAE and used tail initialization. For CausalVideoVAE, we trained for 100k steps in the first stage with a video shape of 9×256×256. Subsequently, we increased the frame count from 9 to 25 and found that this significantly improved the model's performance. It is important to clarify that we enabled the mixed factor during both the first and second stages, with a value of a (sigmoid(mixed factor)) reaching 0.88 at the end of training, indicating the model's tendency to retain low-frequency information. In the third stage, we reinitialized the mixed factor to 0.5 (sigmoid(0.5) = 0.6225), which further enhanced the model's capabilities.

So what these three stages means? can you describe it in details? thank you!

LinB203 commented 3 months ago

In fact, based on our current experience, there is no need to start training from frame 9. Training at 25 or 33 frames is a good start.

GFENGG commented 3 months ago

In fact, based on our current experience, there is no need to start training from frame 9. Training at 25 or 33 frames is a good start.

Thanks for your reply, i have another question, what is the initial value of mix_factor (used in time down and up sampling) during training?

PKU-YuanGroup / Open-Sora-Plan