Vchitect / Latte

Latte: Latent Diffusion Transformer for Video Generation.
Apache License 2.0
1.44k stars 147 forks source link

What is the difference between Latte and ViViT? #59

Open Leeeshuz opened 3 months ago

Leeeshuz commented 3 months ago

The architectures of Latte seem share the same idea with ViViT, what is the difference between Latte and ViViT?

maxin-cn commented 3 months ago

The architectures of Latte seem share the same idea with ViViT, what is the difference between Latte and ViViT?

Hi, thanks for your interest. The main model architecture differences are:

  1. We removed the CLS token. We believe that only using CLS token will lose the vast majority of information for the video generation task.
  2. We have added an additional Variant 1 to the model design of Latte. We find that Variant 1 is better than the additional three variants under the same parameters.
Leeeshuz commented 3 months ago

I see. Some details are quite different. Thanks for your explanation!