databricks / dbrx

Code examples and resources for DBRX, a large language model developed by Databricks
https://www.databricks.com/
Other
2.47k stars 231 forks source link

Silu or Glu activation? #14

Closed jcao-ai closed 3 months ago

jcao-ai commented 3 months ago

According to Model card on huggingface:

DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA).

However, when I run

config = AutoConfig.from_pretrained('/models/dbrx-instruct/')
print(config.ffn_config)

It shows:

DbrxFFNConfig {
  "ffn_act_fn": {
    "name": "silu"
  },
  "ffn_hidden_size": 10752,
  "moe_jitter_eps": 0,
  "moe_loss_weight": 0.05,
  "moe_normalize_expert_weights": 1,
  "moe_num_experts": 16,
  "moe_top_k": 4,
  "transformers_version": "4.38.1",
  "uniform_expert_assignment": false
}

It is somehow misleading and confusing.

megha95 commented 3 months ago

Hi @jcao-ai , SiLU is the activation function used inside GLU. GLU (Gated Linear Units) is the FFN structure. You can read more about this in paper. Hope this helps.