Open handoku opened 2 months ago
Hi @handoku , could you please try trtllm 0.11.0 with triton-trtllm-24.07 to see if this issue still exists?
@QiJune It seems that 0.11.0 has some minor bugs for qwen2-moe int8 weight only quantization.
First, I got AttributeError: 'PretrainedConfig' object has no attribute 'moe'
[08/04/2024-21:33:19] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 470.161.03. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
0.11.0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:06<00:00, 1.30it/s]
Traceback (most recent call last):
File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 340, in <module>
main()
File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 332, in main
convert_and_save_hf(args)
File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 288, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 295, in execute
f(args, rank)
File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 275, in convert_and_save_rank
qwen = from_hugging_face(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1265, in from_hugging_face
qwen = cls.from_config(pretrained_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 400, in from_config
return cls(config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 361, in __call__
obj = type.__call__(cls, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 220, in __init__
transformer = QWenModel(config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 167, in __init__
self.layers = DecoderLayerList(QWenDecoderLayer, config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 289, in __init__
super().__init__([cls(config, idx) for idx in self.layer_list])
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 289, in <listcomp>
super().__init__([cls(config, idx) for idx in self.layer_list])
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 68, in __init__
"moe_config": config.moe,
AttributeError: 'PretrainedConfig' object has no attribute 'moe'
After borrowed QwenConfig
from main branch, I got this:
root@host:~/workspace/TensorRT-LLM-0.11.0/examples/qwen# python convert_checkpoint.py --model_dir /root/workspace/data_dir/Qwen1.5-MoE-A2.7B-Chat --output_dir /root/workspace/data_dir/agent/trtllm_ckpt/Qwen1.5-MoE-A2.7B-Chat --dtype float16 --use_weight_only --weight_only_precision int8 --load_model_on_cpu
[08/04/2024-22:00:52] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 470.161.03. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
0.11.0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 5.19it/s]
Weights loaded. Total time: 00:02:52
Traceback (most recent call last):
File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 340, in <module>
main()
File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 332, in main
convert_and_save_hf(args)
File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 288, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 295, in execute
f(args, rank)
File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 275, in convert_and_save_rank
qwen = from_hugging_face(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1278, in from_hugging_face
qwen.load(weights)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 439, in load
raise RuntimeError(
RuntimeError: Required but not provided tensors:{'transformer.layers.16.shared_expert_gate.per_channel_scale', 'transformer.layers.10.mlp.router.per_channel_scale', 'transformer.layers.0.shared_expert_gate.per_channel_scale', 'transformer.layers.6.shared_expert_gate.per_channel_scale', 'transformer.layers.2.shared_expert_gate.per_channel_scale', 'transformer.layers.3.shared_expert_gate.per_channel_scale', 'transformer.layers.21.shared_expert_gate.per_channel_scale', 'transformer.layers.8.mlp.router.per_channel_scale', 'transformer.layers.14.mlp.router.per_channel_scale', 'transformer.layers.20.mlp.router.per_channel_scale', 'transformer.layers.18.shared_expert_gate.per_channel_scale', 'transformer.layers.7.shared_expert_gate.per_channel_scale', 'transformer.layers.19.mlp.router.per_channel_scale', 'transformer.layers.4.shared_expert_gate.per_channel_scale', 'transformer.layers.9.mlp.router.per_channel_scale', 'transformer.layers.21.mlp.router.per_channel_scale', 'transformer.vocab_embedding.per_token_scale', 'transformer.layers.2.mlp.router.per_channel_scale', 'transformer.layers.17.shared_expert_gate.per_channel_scale', 'transformer.layers.20.shared_expert_gate.per_channel_scale', 'transformer.layers.16.mlp.router.per_channel_scale', 'transformer.layers.5.shared_expert_gate.per_channel_scale', 'transformer.layers.13.shared_expert_gate.per_channel_scale', 'transformer.layers.8.shared_expert_gate.per_channel_scale', 'lm_head.per_channel_scale', 'transformer.layers.5.mlp.router.per_channel_scale', 'transformer.layers.12.mlp.router.per_channel_scale', 'transformer.layers.6.mlp.router.per_channel_scale', 'transformer.layers.23.mlp.router.per_channel_scale', 'transformer.layers.15.shared_expert_gate.per_channel_scale', 'transformer.layers.3.mlp.router.per_channel_scale', 'transformer.layers.1.mlp.router.per_channel_scale', 'transformer.layers.9.shared_expert_gate.per_channel_scale', 'transformer.layers.1.shared_expert_gate.per_channel_scale', 'transformer.layers.12.shared_expert_gate.per_channel_scale', 'transformer.layers.7.mlp.router.per_channel_scale', 'transformer.layers.0.mlp.router.per_channel_scale', 'transformer.layers.4.mlp.router.per_channel_scale', 'transformer.layers.19.shared_expert_gate.per_channel_scale', 'transformer.layers.23.shared_expert_gate.per_channel_scale', 'transformer.layers.17.mlp.router.per_channel_scale', 'transformer.layers.10.shared_expert_gate.per_channel_scale', 'transformer.layers.22.shared_expert_gate.per_channel_scale', 'transformer.layers.18.mlp.router.per_channel_scale', 'transformer.layers.15.mlp.router.per_channel_scale', 'transformer.layers.22.mlp.router.per_channel_scale', 'transformer.layers.13.mlp.router.per_channel_scale', 'transformer.layers.11.shared_expert_gate.per_channel_scale', 'transformer.layers.14.shared_expert_gate.per_channel_scale', 'transformer.layers.11.mlp.router.per_channel_scale'}
Exception ignored in: <function PretrainedModel.__del__ at 0x7f8f229e5000>
However, router
and shared_expert_gate
should already be excluded by this
I am still working on this, I will appreciate that if anyone has a quick-fix advice. Thanks in advance.
update : after set exclude_module = ['*lm_head', '*router', '*vocab_embedding', '*shared_expert_gate']
, convert_checkpoint and trtllm-build succeeded.
I am using trtllm 0.8.0 (added moe support following llama's implementation). we serve models with trtllm_backend (docker images triton-trtllm-24.02)
qwen2-moe-57B-A14B can run well on single
nvidia-A800
. But, if we run it with tp=2 (two A800 or L40), here is what we got:this error occurs when sending large amount of requests to tritonserver.
here is my build config for two A800:
looking for help, it maybe a bug in moe kernel.