Closed fatbao55 closed 2 weeks ago
self.norm_out = AdaLayerNorm(
embedding_dim=time_embed_dim,
output_dim=2 * inner_dim,
norm_elementwise_affine=norm_elementwise_affine,
norm_eps=norm_eps,
chunk_dim=1,
)
You can check AdaLayerNorm for diffusers model and for SAT is class AdaLNMixin(BaseMixin):
and for
2 . You only need to change the input dimension of the first linear layer to achieve this functionality; retraining is not necessary.
😊
Thanks for the clarification!
Hi authors,
Thanks for the great work! I have a few questions regarding the implementation:
In the "Expert Transformer Block" section, you mentioned that Vision Expert Adaptive Layernorm (Vision Expert AdaLN) and Text Expert Adaptive Layernorm (Text Expert AdaLN) apply the modulation mechanism to the vision hidden states and text hidden states, respectively. However, it does not seem like there are separate adaptive layer norms for text and vision hidden states as described in the paper ("Vision Expert AdaLN" and "Text Expert AdaLN"). Instead, the code seems to use a unified layer norm (CogVideoXLayerNormZero) for both. Additionally, it appears to be a normal layer norm rather than an adaptive layer norm. I might be mistaken, but could you please point me to the part of the code where the text and vision adaptive layer norms are implemented?
For the image-to-video implementation, the latent channel would have been doubled. Do you also double the number of channels in the transformer block to accommodate this? And does this mean the text-to-video transformer blocks have to be retrained?
Hope to get your advice on these. Thanks!