Why does LargeDiT 3B actually print 4.2B parameters ?

Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development

https://llama2-accessory.readthedocs.io/

Other

2.72k stars 176 forks source link

Why does LargeDiT 3B actually print 4.2B parameters ? #177

Open nemonameless opened 8 months ago

nemonameless commented 8 months ago

and 7B actually print 7.2B parameters ? @ChrisLiu6

gaopengpjlab commented 8 months ago

Large-DiT-3B follow the naming practice of LLaMa-3B. As diffusion model add AdaLN-Zero which dynamically predict bias/norm to modulate the diffusion backbone. The actual parameter is increased to 4.2 billion.

Best Wishes

nemonameless commented 8 months ago

Then why did 7b only increase to 7.2B?
and is there a standard llama configuration for 3B?

nemonameless commented 8 months ago

@gaopengpjlab

gaopengpjlab commented 8 months ago

We will adjust our naming practices to reflect the true parameters counts of our model in the future. Thanks for your suggestion.

nemonameless commented 8 months ago

and LargeDiT-T2I 3B actually print 5B parameters...

gaopengpjlab commented 8 months ago

@nemonameless the key-query weight of zero-init attention module contribute to an extra 1B parameter. We will clarify this in the future. Thanks for your timely feedback.