NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.58k stars 2.37k forks source link

Fix FLOPs calculation #995

Closed janEbert closed 3 months ago

janEbert commented 3 months ago

The denominator of the embedding/logit FLOP calculation was wrong, becoming 12 instead of 16 in the original formula.

Expanding the calculation for reference for a model with simple MHA, no GLU, a factor 4 MLP expansion, and no MoE:

12 * gbs * seq_len * num_layers * hidden_size^2
* (
    (1 + 1 + (seq_length / hidden_size)) * 1
    + (4 * 1 * 1)
    + (vocab_size / (2 * num_layers * hidden_size))
)

= 12 * gbs * seq_len * num_layers * hidden_size^2
* (
    (2 + (seq_length / hidden_size))
    + 4
    + (vocab_size / (2 * num_layers * hidden_size))
)

= 12 * gbs * seq_len * num_layers * hidden_size^2
* (
    6
    + (seq_length / hidden_size)
    + (vocab_size / (2 * num_layers * hidden_size))
)

= 24 * gbs * seq_len * num_layers * hidden_size^2
* (
    3
    + (seq_length / (2 * hidden_size))
    + (vocab_size / (4 * num_layers * hidden_size))
)

= 72 * gbs * seq_len * num_layers * hidden_size^2
* (
    1
    + (seq_length / (6 * hidden_size))
    + (vocab_size / (12 * num_layers * hidden_size))
)

However, the last formula should be

72 * gbs * seq_len * num_layers * hidden_size^2
* (
    1
    + (seq_length / (6 * hidden_size))
    + (vocab_size / (16 * num_layers * hidden_size))
)

(i.e., change 12 to 16 in the final line) to be consistent with the Megatron-LM PTD-P paper's formula. This is achieved with this commit's addition of 2/3 to the embedding/logit calculation denominator.

janEbert commented 3 months ago

I'm wrong, the denominator is correct because of the pre-factor 72.