THUDM / CogVideo

Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
7.17k stars 659 forks source link

Questions about implementation #161

Closed fatbao55 closed 2 weeks ago

fatbao55 commented 3 weeks ago

Hi authors,

Thanks for the great work! I have a few questions regarding the implementation:

  1. In the "Expert Transformer Block" section, you mentioned that Vision Expert Adaptive Layernorm (Vision Expert AdaLN) and Text Expert Adaptive Layernorm (Text Expert AdaLN) apply the modulation mechanism to the vision hidden states and text hidden states, respectively. However, it does not seem like there are separate adaptive layer norms for text and vision hidden states as described in the paper ("Vision Expert AdaLN" and "Text Expert AdaLN"). Instead, the code seems to use a unified layer norm (CogVideoXLayerNormZero) for both. Additionally, it appears to be a normal layer norm rather than an adaptive layer norm. I might be mistaken, but could you please point me to the part of the code where the text and vision adaptive layer norms are implemented?

  2. For the image-to-video implementation, the latent channel would have been doubled. Do you also double the number of channels in the transformer block to accommodate this? And does this mean the text-to-video transformer blocks have to be retrained?

Hope to get your advice on these. Thanks!

zRzRzRzRzRzRzR commented 2 weeks ago
  1. You Can Check
    self.norm_out = AdaLayerNorm(
            embedding_dim=time_embed_dim,
            output_dim=2 * inner_dim,
            norm_elementwise_affine=norm_elementwise_affine,
            norm_eps=norm_eps,
            chunk_dim=1,
        )

    You can check AdaLayerNorm for diffusers model and for SAT is class AdaLNMixin(BaseMixin): and for 2 . You only need to change the input dimension of the first linear layer to achieve this functionality; retraining is not necessary.

😊

fatbao55 commented 2 weeks ago

Thanks for the clarification!