Questions about implementation

fatbao55 commented 3 weeks ago

Hi authors,

Thanks for the great work! I have a few questions regarding the implementation:

In the "Expert Transformer Block" section, you mentioned that Vision Expert Adaptive Layernorm (Vision Expert AdaLN) and Text Expert Adaptive Layernorm (Text Expert AdaLN) apply the modulation mechanism to the vision hidden states and text hidden states, respectively. However, it does not seem like there are separate adaptive layer norms for text and vision hidden states as described in the paper ("Vision Expert AdaLN" and "Text Expert AdaLN"). Instead, the code seems to use a unified layer norm (CogVideoXLayerNormZero) for both. Additionally, it appears to be a normal layer norm rather than an adaptive layer norm. I might be mistaken, but could you please point me to the part of the code where the text and vision adaptive layer norms are implemented?
For the image-to-video implementation, the latent channel would have been doubled. Do you also double the number of channels in the transformer block to accommodate this? And does this mean the text-to-video transformer blocks have to be retrained?

Hope to get your advice on these. Thanks!

zRzRzRzRzRzRzR commented 2 weeks ago

😊

fatbao55 commented 2 weeks ago

Thanks for the clarification!

THUDM / CogVideo