Closed janEbert closed 3 months ago
The denominator of the embedding/logit FLOP calculation was wrong, becoming 12 instead of 16 in the original formula.
Expanding the calculation for reference for a model with simple MHA, no GLU, a factor 4 MLP expansion, and no MoE:
12 * gbs * seq_len * num_layers * hidden_size^2 * ( (1 + 1 + (seq_length / hidden_size)) * 1 + (4 * 1 * 1) + (vocab_size / (2 * num_layers * hidden_size)) ) = 12 * gbs * seq_len * num_layers * hidden_size^2 * ( (2 + (seq_length / hidden_size)) + 4 + (vocab_size / (2 * num_layers * hidden_size)) ) = 12 * gbs * seq_len * num_layers * hidden_size^2 * ( 6 + (seq_length / hidden_size) + (vocab_size / (2 * num_layers * hidden_size)) ) = 24 * gbs * seq_len * num_layers * hidden_size^2 * ( 3 + (seq_length / (2 * hidden_size)) + (vocab_size / (4 * num_layers * hidden_size)) ) = 72 * gbs * seq_len * num_layers * hidden_size^2 * ( 1 + (seq_length / (6 * hidden_size)) + (vocab_size / (12 * num_layers * hidden_size)) )
However, the last formula should be
72 * gbs * seq_len * num_layers * hidden_size^2 * ( 1 + (seq_length / (6 * hidden_size)) + (vocab_size / (16 * num_layers * hidden_size)) )
(i.e., change 12 to 16 in the final line) to be consistent with the Megatron-LM PTD-P paper's formula. This is achieved with this commit's addition of 2/3 to the embedding/logit calculation denominator.
I'm wrong, the denominator is correct because of the pre-factor 72.
The denominator of the embedding/logit FLOP calculation was wrong, becoming 12 instead of 16 in the original formula.
Expanding the calculation for reference for a model with simple MHA, no GLU, a factor 4 MLP expansion, and no MoE:
However, the last formula should be
(i.e., change 12 to 16 in the final line) to be consistent with the Megatron-LM PTD-P paper's formula. This is achieved with this commit's addition of 2/3 to the embedding/logit calculation denominator.