[BUG] MoE Router TopK algorithm is differeent from huggingface implement

Describe the bug A clear and concise description of what the bug is. In Megatron:topk_softmax_with_capacity()

    scores, top_indices = torch.topk(logits, k=topk, dim=1)     # topk first
    probs = torch.softmax(scores, dim=-1, dtype=torch.float32).type_as(logits)

In transformers-mixtral:MixtralSparseMoeBlock

        routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1

This two algorithm result in two different inference anser.

Is there any reason why megatron calculate probs like this?

To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

Expected behavior A clear and concise description of what you expected to happen.

Stack trace/logs If applicable, add the stack trace or logs from the time of the error.

Environment (please complete the following information):

Megatron-LM commit ID
PyTorch version
CUDA version
NCCL version

Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context Add any other context about the problem here.

NVIDIA / Megatron-LM

[BUG] MoE Router TopK algorithm is differeent from huggingface implement #955