Normalize top-k scores as usual, since there's no clear motivation for the same. Good thing is that it's config driven in the HF implementation for OLMoE
📈 Potential Benefits
No clear benefits, but it could instead slow down training by a bit since now we're applying softmax on logits from all experts.
🧐 Problem Description
OLMoE has disabled the normalization for top-k routing probabilities. There is no clear motivation or ablation for why this was done. DeepSeekMoE also disables top-k normalization, while Mixtral-8x7b-v0.1 normalizes them.
💡 Proposed Solution
Apply softmax before
torch.topk
in https://github.com/ServiceNow/Fast-LLM/blob/51d57158d625883da189bcce3af3c8908e527824/fast_llm/layers/transformer/mixture_of_experts.py#L167🔄 Alternatives Considered
Normalize top-k scores as usual, since there's no clear motivation for the same. Good thing is that it's config driven in the HF implementation for OLMoE
📈 Potential Benefits
No clear benefits, but it could instead slow down training by a bit since now we're applying softmax on logits from all experts.
📝 Additional Context
See OLMoE implementation for reference