Closed jingjie01ai closed 3 days ago
Please refer to https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/llama/convert.py#L862 and https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/layers/moe.py#L344
We concat the weights in w3w1 order rather than w1w3 and that's the reason we implement swiglu as below.
def swiglu(input: Tensor)
x, gate = chunk(input, 2, dim=-1) # x stands for w3 while gate stands for w1
return silu(gate) * x
You may also refer to mixtral modeling definition.
ok, get it
System Info
hopper gpu inference
Who can help?
@byshiue
https://github.com/NVIDIA/TensorRT-LLM/blob/9691e12bce7ae1c126c435a049eb516eb119486c/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu#L998 https://github.com/vllm-project/vllm/blob/main/csrc/activation_kernels.cu#L75 https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/moe/experts.py#L48
The implementation approach of SwigLU in TensorRT-LLM differs from mainstream methods. In frameworks like Megatron/Hugging Face for training and VLLM for inference, the activation pattern is SILU(x) y, whereas in TensorRT it is x SILU(y). This difference is non-trivial and can lead to abnormal results that are difficult to diagnose.
what the func looks like:
megatron moe: def glu(x): x = torch.chunk(x, 2, dim=-1) return F.silu(x[0]) * x[1]
trt-llm: def glu(x): x = torch.chunk(x, 2, dim=-1) return F.silu(x[1]) * x[0]
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
https://github.com/NVIDIA/TensorRT-LLM/blob/9691e12bce7ae1c126c435a049eb516eb119486c/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu#L998 https://github.com/vllm-project/vllm/blob/main/csrc/activation_kernels.cu#L75 https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/moe/experts.py#L48
Expected behavior
keep the same impl with the training framework
actual behavior
Inference results are not aligned
additional notes
NA