NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.58k stars 2.37k forks source link

[BUG] MoE Router TopK algorithm is differeent from huggingface implement #955

Closed Au3C2 closed 3 months ago

Au3C2 commented 3 months ago

Describe the bug A clear and concise description of what the bug is. In Megatron:topk_softmax_with_capacity()

    scores, top_indices = torch.topk(logits, k=topk, dim=1)     # topk first
    probs = torch.softmax(scores, dim=-1, dtype=torch.float32).type_as(logits)

In transformers-mixtral:MixtralSparseMoeBlock

        routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1

This two algorithm result in two different inference anser.

Is there any reason why megatron calculate probs like this?

To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

Expected behavior A clear and concise description of what you expected to happen.

Stack trace/logs If applicable, add the stack trace or logs from the time of the error.

Environment (please complete the following information):

Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context Add any other context about the problem here.

kiddyboots216 commented 3 months ago

See https://github.com/NVIDIA/Megatron-LM/issues/894#issuecomment-2212765002