Closed jeqcho closed 3 weeks ago
Our mlp_hidden_size
is multiplied by a constant that depends on the activation function:
Our SwiGLU activation uses 0.5, and so our hidden size is halved. As a result, MLP ratio is 22016/4096/2=2.6875≈8/3
.
Thanks for the clarification!
Hi,
In the paper, a
mlp_ratio
of ~8/3 is reported for the OLMo 7B model.However, in the configuration file, the d_model is listed as 4096, and the mlp_hidden_size is 22016. This results in a mlp_ratio of 22016/4096=5.375, which significantly differs from the reported 8/3 (approximately 2.67).
Here is the relevant section of the configuration file: https://github.com/allenai/OLMo/blob/ddc884712e991608b69f7f6c04f464d5304f19d3/configs/official/OLMo-7B.yaml#L10-L14
Additionally, it is mentioned here that
mlp_hidden_size = mlp_ratio * d_model
: https://github.com/allenai/OLMo/blob/ddc884712e991608b69f7f6c04f464d5304f19d3/olmo/config.py#L262-L271