Vchitect / Latte

Latte: Latent Diffusion Transformer for Video Generation.
Apache License 2.0
1.44k stars 147 forks source link

question on t2v model training #75

Open diffusion-lover opened 2 months ago

diffusion-lover commented 2 months ago

Thanks for sharing a great work!

I found two folders for vae in t2v pretrained model: https://huggingface.co/maxin-cn/Latte/tree/main/t2v_required_models

  1. vae
  2. vae_temporal_decoder

It seems that t2v model uses the "vae_temporal_decoder" pretrained model for decoding latents. The "vae" pretrained model is used to encode images when you train transformer network?

maxin-cn commented 2 months ago

Thanks for sharing a great work!

I found two folders for vae in t2v pretrained model: https://huggingface.co/maxin-cn/Latte/tree/main/t2v_required_models

  1. vae
  2. vae_temporal_decoder

It seems that t2v model uses the "vae_temporal_decoder" pretrained model for decoding latents. The "vae" pretrained model is used to encode images when you train transformer network?

Yes, that is right.