Closed Au3C2 closed 3 months ago
Describe the bug A clear and concise description of what the bug is. In Megatron:topk_softmax_with_capacity()
scores, top_indices = torch.topk(logits, k=topk, dim=1) # topk first probs = torch.softmax(scores, dim=-1, dtype=torch.float32).type_as(logits)
In transformers-mixtral:MixtralSparseMoeBlock
routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float) routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1
This two algorithm result in two different inference anser.
Is there any reason why megatron calculate probs like this?
To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
Expected behavior A clear and concise description of what you expected to happen.
Stack trace/logs If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context Add any other context about the problem here.
See https://github.com/NVIDIA/Megatron-LM/issues/894#issuecomment-2212765002
Describe the bug A clear and concise description of what the bug is. In Megatron:topk_softmax_with_capacity()
In transformers-mixtral:MixtralSparseMoeBlock
This two algorithm result in two different inference anser.
Is there any reason why megatron calculate probs like this?
To Reproduce Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
Expected behavior A clear and concise description of what you expected to happen.
Stack trace/logs If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context Add any other context about the problem here.