Fix FLOPs calculation - Githubissues

The denominator of the embedding/logit FLOP calculation was wrong, becoming 12 instead of 16 in the original formula.

Expanding the calculation for reference for a model with simple MHA, no GLU, a factor 4 MLP expansion, and no MoE:

12 * gbs * seq_len * num_layers * hidden_size^2
* (
    (1 + 1 + (seq_length / hidden_size)) * 1
    + (4 * 1 * 1)
    + (vocab_size / (2 * num_layers * hidden_size))
)

= 12 * gbs * seq_len * num_layers * hidden_size^2
* (
    (2 + (seq_length / hidden_size))
    + 4
    + (vocab_size / (2 * num_layers * hidden_size))
)

= 12 * gbs * seq_len * num_layers * hidden_size^2
* (
    6
    + (seq_length / hidden_size)
    + (vocab_size / (2 * num_layers * hidden_size))
)

= 24 * gbs * seq_len * num_layers * hidden_size^2
* (
    3
    + (seq_length / (2 * hidden_size))
    + (vocab_size / (4 * num_layers * hidden_size))
)

= 72 * gbs * seq_len * num_layers * hidden_size^2
* (
    1
    + (seq_length / (6 * hidden_size))
    + (vocab_size / (12 * num_layers * hidden_size))
)

However, the last formula should be

72 * gbs * seq_len * num_layers * hidden_size^2
* (
    1
    + (seq_length / (6 * hidden_size))
    + (vocab_size / (16 * num_layers * hidden_size))
)

(i.e., change 12 to 16 in the final line) to be consistent with the Megatron-LM PTD-P paper's formula. This is achieved with this commit's addition of 2/3 to the embedding/logit calculation denominator.

NVIDIA / Megatron-LM

Fix FLOPs calculation #995