moe kernel Assertion failed when running qwen2-moe-57B-A14B with TP enabled

I am using trtllm 0.8.0 (added moe support following llama's implementation). we serve models with trtllm_backend (docker images triton-trtllm-24.02)

qwen2-moe-57B-A14B can run well on single nvidia-A800. But, if we run it with tp=2 (two A800 or L40), here is what we got:

I0709 12:38:32.372394 271 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8101
I0709 12:38:32.372581 271 http_server.cc:4685] Started HTTPService at 0.0.0.0:8100
I0709 12:38:32.424085 271 http_server.cc:320] Started Metrics Service at 0.0.0.0:8102
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: GPU lacks the shared memory resources to run GroupedGEMM kernel (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_kernels_template.h:122)
1       0x7fa66c2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fa66c5f4273 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0xb0b273) [0x7fa66c5f4273]
3       0x7fa66c61561f void tensorrt_llm::MoeGemmRunner<__half, unsigned char>::runGemm<tensorrt_llm::cutlass_extensions::EpilogueOpDefault>(__half const*, unsigned char const*, __half const*, __half const*, __half*, long*, long, long, long, int, CUstream_st*) + 591
4       0x7fa66dce0f67 tensorrt_llm::kernels::CutlassMoeFCRunner<__half, unsigned char, void>::runMoe(void const*, float const*, void const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, void const*, int, int, int, int, int, char*, void*, void*, bool const*, int, void*, int*, int*, tensorrt_llm::kernels::MOEParallelismConfig, tensorrt_llm::kernels::MOEExpertScaleNormalizationMode, CUstream_st*) + 1751
5       0x7fa750f1ad9a tensorrt_llm::plugins::MixtureOfExpertsPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 954
6       0x7fa627706ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fa627706ba9]
7       0x7fa6276dc6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fa6276dc6af]
8       0x7fa6276de320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fa6276de320]
9       0x7fa66e145a7b tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 59
10      0x7fa66e147714 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1188
11      0x7fa66e14d724 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3716
12      0x7fa66e11da68 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 56
13      0x7fa66e1227c7 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 247
14      0x7fa76e4b0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fa76e4b0253]
15      0x7fa76e158ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa76e158ac3]
16      0x7fa76e1ea850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fa76e1ea850]

this error occurs when sending large amount of requests to tritonserver.

here is my build config for two A800:

python build.py --hf_model_dir /data/cbs/models/Qwen2-57B-A14B-Instruct \
                --dtype float16 \
    --use_inflight_batching \
    --use_gpt_attention_plugin float16 \
    --enable_context_fmha \
    --use_gemm_plugin float16 \
    --max_batch_size 128 \
    --max_input_len 6144 \
    --max_output_len 1024 \
    --max_num_tokens 204800 \
    --use_weight_only \
    --weight_only_precision int8 \
    --tp_size 2 \
    --world_size 2 \
    --tp_mode 2 \
    --output_dir /data/cbs/engines/Qwen2-57B-A14B-Instruct

looking for help, it maybe a bug in moe kernel.

@QiJune It seems that 0.11.0 has some minor bugs for qwen2-moe int8 weight only quantization. First, I got AttributeError: 'PretrainedConfig' object has no attribute 'moe'

[08/04/2024-21:33:19] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 470.161.03. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
0.11.0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:06<00:00,  1.30it/s]
Traceback (most recent call last):
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 340, in <module>
    main()
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 332, in main
    convert_and_save_hf(args)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 288, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 295, in execute
    f(args, rank)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 275, in convert_and_save_rank
    qwen = from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1265, in from_hugging_face
    qwen = cls.from_config(pretrained_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 400, in from_config
    return cls(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 361, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 220, in __init__
    transformer = QWenModel(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 167, in __init__
    self.layers = DecoderLayerList(QWenDecoderLayer, config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 289, in __init__
    super().__init__([cls(config, idx) for idx in self.layer_list])
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 289, in <listcomp>
    super().__init__([cls(config, idx) for idx in self.layer_list])
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 68, in __init__
    "moe_config": config.moe,
AttributeError: 'PretrainedConfig' object has no attribute 'moe'

After borrowed QwenConfig from main branch, I got this:

root@host:~/workspace/TensorRT-LLM-0.11.0/examples/qwen# python convert_checkpoint.py --model_dir /root/workspace/data_dir/Qwen1.5-MoE-A2.7B-Chat                               --output_dir /root/workspace/data_dir/agent/trtllm_ckpt/Qwen1.5-MoE-A2.7B-Chat                               --dtype float16                               --use_weight_only                               --weight_only_precision int8 --load_model_on_cpu
[08/04/2024-22:00:52] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 470.161.03. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
0.11.0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  5.19it/s]
Weights loaded. Total time: 00:02:52
Traceback (most recent call last):
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 340, in <module>
    main()
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 332, in main
    convert_and_save_hf(args)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 288, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 295, in execute
    f(args, rank)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 275, in convert_and_save_rank
    qwen = from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1278, in from_hugging_face
    qwen.load(weights)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 439, in load
    raise RuntimeError(
RuntimeError: Required but not provided tensors:{'transformer.layers.16.shared_expert_gate.per_channel_scale', 'transformer.layers.10.mlp.router.per_channel_scale', 'transformer.layers.0.shared_expert_gate.per_channel_scale', 'transformer.layers.6.shared_expert_gate.per_channel_scale', 'transformer.layers.2.shared_expert_gate.per_channel_scale', 'transformer.layers.3.shared_expert_gate.per_channel_scale', 'transformer.layers.21.shared_expert_gate.per_channel_scale', 'transformer.layers.8.mlp.router.per_channel_scale', 'transformer.layers.14.mlp.router.per_channel_scale', 'transformer.layers.20.mlp.router.per_channel_scale', 'transformer.layers.18.shared_expert_gate.per_channel_scale', 'transformer.layers.7.shared_expert_gate.per_channel_scale', 'transformer.layers.19.mlp.router.per_channel_scale', 'transformer.layers.4.shared_expert_gate.per_channel_scale', 'transformer.layers.9.mlp.router.per_channel_scale', 'transformer.layers.21.mlp.router.per_channel_scale', 'transformer.vocab_embedding.per_token_scale', 'transformer.layers.2.mlp.router.per_channel_scale', 'transformer.layers.17.shared_expert_gate.per_channel_scale', 'transformer.layers.20.shared_expert_gate.per_channel_scale', 'transformer.layers.16.mlp.router.per_channel_scale', 'transformer.layers.5.shared_expert_gate.per_channel_scale', 'transformer.layers.13.shared_expert_gate.per_channel_scale', 'transformer.layers.8.shared_expert_gate.per_channel_scale', 'lm_head.per_channel_scale', 'transformer.layers.5.mlp.router.per_channel_scale', 'transformer.layers.12.mlp.router.per_channel_scale', 'transformer.layers.6.mlp.router.per_channel_scale', 'transformer.layers.23.mlp.router.per_channel_scale', 'transformer.layers.15.shared_expert_gate.per_channel_scale', 'transformer.layers.3.mlp.router.per_channel_scale', 'transformer.layers.1.mlp.router.per_channel_scale', 'transformer.layers.9.shared_expert_gate.per_channel_scale', 'transformer.layers.1.shared_expert_gate.per_channel_scale', 'transformer.layers.12.shared_expert_gate.per_channel_scale', 'transformer.layers.7.mlp.router.per_channel_scale', 'transformer.layers.0.mlp.router.per_channel_scale', 'transformer.layers.4.mlp.router.per_channel_scale', 'transformer.layers.19.shared_expert_gate.per_channel_scale', 'transformer.layers.23.shared_expert_gate.per_channel_scale', 'transformer.layers.17.mlp.router.per_channel_scale', 'transformer.layers.10.shared_expert_gate.per_channel_scale', 'transformer.layers.22.shared_expert_gate.per_channel_scale', 'transformer.layers.18.mlp.router.per_channel_scale', 'transformer.layers.15.mlp.router.per_channel_scale', 'transformer.layers.22.mlp.router.per_channel_scale', 'transformer.layers.13.mlp.router.per_channel_scale', 'transformer.layers.11.shared_expert_gate.per_channel_scale', 'transformer.layers.14.shared_expert_gate.per_channel_scale', 'transformer.layers.11.mlp.router.per_channel_scale'}
Exception ignored in: <function PretrainedModel.__del__ at 0x7f8f229e5000>

However, router and shared_expert_gate should already be excluded by this

I am still working on this, I will appreciate that if anyone has a quick-fix advice. Thanks in advance.

update : after set exclude_module = ['*lm_head', '*router', '*vocab_embedding', '*shared_expert_gate'], convert_checkpoint and trtllm-build succeeded.

NVIDIA / TensorRT-LLM

moe kernel Assertion failed when running qwen2-moe-57B-A14B with TP enabled #1925