NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

unable to build qwen awq model with multi gpus #776

Open tbup opened 8 months ago

tbup commented 8 months ago

python quantize.py --model_dir /qwen-14b-chat --dtype float16 --qformat int4_awq --export_path ./qwen_14b_4bit_gs128_awq.pt --calib_size 32

python build.py --hf_model_dir=/qwen-14b-chat/ --quant_ckpt_path ./qwen_14b_4bit_gs128_awq.pt --output_dir ./tmp/ --dtype float16 --use_inflight_batching --paged_kv_cache --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --remove_input_padding --max_batch_size 16 --enable_context_fmha --use_weight_only --weight_only_precision int4_awq --per_group --world_size 2 --tp_size 2

[12/29/2023-11:05:28] [TRT-LLM] [I] Serially build TensorRT engines. [12/29/2023-11:05:28] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 113, GPU 263 (MiB) [12/29/2023-11:05:32] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2048, GPU 575 (MiB) [12/29/2023-11:05:32] [TRT-LLM] [W] Invalid timing cache, using freshly created one [12/29/2023-11:05:32] [TRT-LLM] [I] Loading weights from groupwise AWQ Qwen safetensors... [12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 Loading weights...: 0%| | 0/40 [00:00<?, ?it/s][12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 [12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 [12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8 [12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 [12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 [12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 [12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8 [12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 [12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 [12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8 [12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 [12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16 Loading weights...: 0%| | 0/40 [00:01<?, ?it/s] Traceback (most recent call last): File "/app/tensorrt_llm/examples/qwen/build.py", line 642, in build(0, args) File "/app/tensorrt_llm/examples/qwen/build.py", line 612, in build engine = build_rank_engine(builder, builder_config, engine_name, File "/app/tensorrt_llm/examples/qwen/build.py", line 457, in build_rank_engine load_func(tensorrt_llm_qwen=tensorrt_llm_qwen, File "/app/tensorrt_llm/examples/qwen/weight.py", line 897, in load_from_awq_qwen process_and_assign_weight(model_params, mPrefix, mOp, 0) File "/app/tensorrt_llm/examples/qwen/weight.py", line 830, in process_and_assign_weight mOp.qweight.value = AWQ_quantize_pack_preprocess(weight, scale) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 112, in value assert v.shape == self._shape, \ AssertionError: The value updated is not the same shape as the original. Updated: (2560, 6848), original: (5120, 3424)

juney-nvidia commented 8 months ago

@tbup Thanks for reporting this, I will discuss with the engineer adding the Qwen INT4 AWQ support to help investigate it.

June

juney-nvidia commented 8 months ago

@tbup Our engineer already started the investigation and is trying to make the fix.

Thanks June

nanmi commented 8 months ago

@tbup Our engineer already started the investigation and is trying to make the fix.

Thanks June

maybe MLP module gate and up_proj has the same split dim=1(columlinear should split output channel) and down_proj split dim=0(RowLinear should split input channel)