NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

swiglu impl of moe plugin #1842

Closed jingjie01ai closed 3 days ago

jingjie01ai commented 3 days ago

System Info

hopper gpu inference

Who can help?

@byshiue

https://github.com/NVIDIA/TensorRT-LLM/blob/9691e12bce7ae1c126c435a049eb516eb119486c/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu#L998 https://github.com/vllm-project/vllm/blob/main/csrc/activation_kernels.cu#L75 https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/moe/experts.py#L48

The implementation approach of SwigLU in TensorRT-LLM differs from mainstream methods. In frameworks like Megatron/Hugging Face for training and VLLM for inference, the activation pattern is SILU(x) y, whereas in TensorRT it is x SILU(y). This difference is non-trivial and can lead to abnormal results that are difficult to diagnose.

what the func looks like:

megatron moe: def glu(x): x = torch.chunk(x, 2, dim=-1) return F.silu(x[0]) * x[1]

trt-llm: def glu(x): x = torch.chunk(x, 2, dim=-1) return F.silu(x[1]) * x[0]

Information

Tasks

Reproduction

https://github.com/NVIDIA/TensorRT-LLM/blob/9691e12bce7ae1c126c435a049eb516eb119486c/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu#L998 https://github.com/vllm-project/vllm/blob/main/csrc/activation_kernels.cu#L75 https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/moe/experts.py#L48

Expected behavior

keep the same impl with the training framework

actual behavior

Inference results are not aligned

additional notes

NA

nv-guomingz commented 3 days ago

Please refer to https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/llama/convert.py#L862 and https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/layers/moe.py#L344

We concat the weights in w3w1 order rather than w1w3 and that's the reason we implement swiglu as below.

    def swiglu(input: Tensor) 
         x, gate = chunk(input, 2, dim=-1)  # x stands for w3 while gate stands for w1
         return silu(gate) * x 

You may also refer to mixtral modeling definition.

jingjie01ai commented 3 days ago

ok, get it