swiglu impl of moe plugin

jingjie01ai commented 3 days ago

System Info

hopper gpu inference

Who can help?

@byshiue

https://github.com/NVIDIA/TensorRT-LLM/blob/9691e12bce7ae1c126c435a049eb516eb119486c/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu#L998 https://github.com/vllm-project/vllm/blob/main/csrc/activation_kernels.cu#L75 https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/moe/experts.py#L48

The implementation approach of SwigLU in TensorRT-LLM differs from mainstream methods. In frameworks like Megatron/Hugging Face for training and VLLM for inference, the activation pattern is SILU(x) y, whereas in TensorRT it is x SILU(y). This difference is non-trivial and can lead to abnormal results that are difficult to diagnose.

what the func looks like:

megatron moe: def glu(x): x = torch.chunk(x, 2, dim=-1) return F.silu(x[0]) * x[1]

trt-llm: def glu(x): x = torch.chunk(x, 2, dim=-1) return F.silu(x[1]) * x[0]

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

https://github.com/NVIDIA/TensorRT-LLM/blob/9691e12bce7ae1c126c435a049eb516eb119486c/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu#L998 https://github.com/vllm-project/vllm/blob/main/csrc/activation_kernels.cu#L75 https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/moe/experts.py#L48

Expected behavior

keep the same impl with the training framework

actual behavior

Inference results are not aligned

additional notes

NA

nv-guomingz commented 3 days ago

Please refer to https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/llama/convert.py#L862 and https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/layers/moe.py#L344

We concat the weights in w3w1 order rather than w1w3 and that's the reason we implement swiglu as below.

    def swiglu(input: Tensor) 
         x, gate = chunk(input, 2, dim=-1)  # x stands for w3 while gate stands for w1
         return silu(gate) * x

You may also refer to mixtral modeling definition.

jingjie01ai commented 3 days ago

ok, get it

NVIDIA / TensorRT-LLM