Open nemonameless opened 8 months ago
Large-DiT-3B follow the naming practice of LLaMa-3B. As diffusion model add AdaLN-Zero which dynamically predict bias/norm to modulate the diffusion backbone. The actual parameter is increased to 4.2 billion.
Best Wishes
Then why did 7b only increase to 7.2B?
and is there a standard llama configuration for 3B?
@gaopengpjlab
We will adjust our naming practices to reflect the true parameters counts of our model in the future. Thanks for your suggestion.
and LargeDiT-T2I 3B actually print 5B parameters...
@nemonameless the key-query weight of zero-init attention module contribute to an extra 1B parameter. We will clarify this in the future. Thanks for your timely feedback.
and 7B actually print 7.2B parameters ? @ChrisLiu6