NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.33k stars 936 forks source link

chatglm2-6b smoothquant multi-tp build failed on 0.9.0 branch #1490

Open NaNAGISaSA opened 5 months ago

NaNAGISaSA commented 5 months ago

System Info

- CPU architecture: x86_64
- GPU properties
  - GPU name: NVIDIA A100
  - GPU memory size: 40G
- Libraries
  - TensorRT-LLM branch or tag: v0.9.0
  - TensorRT-LLM commit: 250d9c293d
  - Container used: yes, `make -C docker release_build` on v0.9.0 tag
- NVIDIA driver version: 515.105.01
- OS: Ubuntu 22.04

Who can help?

@Tracin

Information

Tasks

Reproduction

tp_size=2

python3 examples/chatglm/convert_checkpoint.py --model_dir ${hf_model_dir} \ --tp_size ${tp_size} \ --workers ${tp_size} \ --dtype float16 \ --smoothquant 0.5 \ --output_dir ${convert_model_dir}/sq/${tp_size}-gpu/

trtllm-build --checkpoint_dir ${convert_model_dir}/sq/${tp_size}-gpu/ \ --output_dir ${trt_engine_dir}/sq/${tp_size}-gpu/ \ --use_fused_mlp \ --gemm_plugin float16 \ --gpt_attention_plugin float16 \ --context_fmha_fp32_acc enable \ --remove_input_padding enable \ --multi_block_mode enable \ --workers ${tp_size} \ --max_batch_size 128 \ --max_input_len 2048 \ --max_output_len 2048

Expected behavior

build success

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.9.0 0.9.0 Inferring chatglm version from path... Chatglm version: chatglm2 Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.56it/s] Calibration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:05<00:00, 11.92it/s] Smoothing module: transformer.encoder.layers.0.self_attention.query_key_value Smoothing module: transformer.encoder.layers.0.self_attention.dense Smoothing module: transformer.encoder.layers.0.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.0.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.1.self_attention.query_key_value Smoothing module: transformer.encoder.layers.1.self_attention.dense Smoothing module: transformer.encoder.layers.1.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.1.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.2.self_attention.query_key_value Smoothing module: transformer.encoder.layers.2.self_attention.dense Smoothing module: transformer.encoder.layers.2.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.2.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.3.self_attention.query_key_value Smoothing module: transformer.encoder.layers.3.self_attention.dense Smoothing module: transformer.encoder.layers.3.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.3.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.4.self_attention.query_key_value Smoothing module: transformer.encoder.layers.4.self_attention.dense Smoothing module: transformer.encoder.layers.4.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.4.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.5.self_attention.query_key_value Smoothing module: transformer.encoder.layers.5.self_attention.dense Smoothing module: transformer.encoder.layers.5.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.5.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.6.self_attention.query_key_value Smoothing module: transformer.encoder.layers.6.self_attention.dense Smoothing module: transformer.encoder.layers.6.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.6.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.7.self_attention.query_key_value Smoothing module: transformer.encoder.layers.7.self_attention.dense Smoothing module: transformer.encoder.layers.7.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.7.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.8.self_attention.query_key_value Smoothing module: transformer.encoder.layers.8.self_attention.dense Smoothing module: transformer.encoder.layers.8.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.8.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.9.self_attention.query_key_value Smoothing module: transformer.encoder.layers.9.self_attention.dense Smoothing module: transformer.encoder.layers.9.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.9.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.10.self_attention.query_key_value Smoothing module: transformer.encoder.layers.10.self_attention.dense Smoothing module: transformer.encoder.layers.10.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.10.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.11.self_attention.query_key_value Smoothing module: transformer.encoder.layers.11.self_attention.dense Smoothing module: transformer.encoder.layers.11.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.11.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.12.self_attention.query_key_value Smoothing module: transformer.encoder.layers.12.self_attention.dense Smoothing module: transformer.encoder.layers.12.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.12.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.13.self_attention.query_key_value Smoothing module: transformer.encoder.layers.13.self_attention.dense Smoothing module: transformer.encoder.layers.13.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.13.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.14.self_attention.query_key_value Smoothing module: transformer.encoder.layers.14.self_attention.dense Smoothing module: transformer.encoder.layers.14.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.14.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.15.self_attention.query_key_value Smoothing module: transformer.encoder.layers.15.self_attention.dense Smoothing module: transformer.encoder.layers.15.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.15.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.16.self_attention.query_key_value Smoothing module: transformer.encoder.layers.16.self_attention.dense Smoothing module: transformer.encoder.layers.16.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.16.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.17.self_attention.query_key_value Smoothing module: transformer.encoder.layers.17.self_attention.dense Smoothing module: transformer.encoder.layers.17.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.17.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.18.self_attention.query_key_value Smoothing module: transformer.encoder.layers.18.self_attention.dense Smoothing module: transformer.encoder.layers.18.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.18.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.19.self_attention.query_key_value Smoothing module: transformer.encoder.layers.19.self_attention.dense Smoothing module: transformer.encoder.layers.19.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.19.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.20.self_attention.query_key_value Smoothing module: transformer.encoder.layers.20.self_attention.dense Smoothing module: transformer.encoder.layers.20.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.20.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.21.self_attention.query_key_value Smoothing module: transformer.encoder.layers.21.self_attention.dense Smoothing module: transformer.encoder.layers.21.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.21.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.22.self_attention.query_key_value Smoothing module: transformer.encoder.layers.22.self_attention.dense Smoothing module: transformer.encoder.layers.22.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.22.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.23.self_attention.query_key_value Smoothing module: transformer.encoder.layers.23.self_attention.dense Smoothing module: transformer.encoder.layers.23.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.23.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.24.self_attention.query_key_value Smoothing module: transformer.encoder.layers.24.self_attention.dense Smoothing module: transformer.encoder.layers.24.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.24.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.25.self_attention.query_key_value Smoothing module: transformer.encoder.layers.25.self_attention.dense Smoothing module: transformer.encoder.layers.25.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.25.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.26.self_attention.query_key_value Smoothing module: transformer.encoder.layers.26.self_attention.dense Smoothing module: transformer.encoder.layers.26.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.26.mlp.dense_4h_to_h Smoothing module: transformer.encoder.layers.27.self_attention.query_key_value Smoothing module: transformer.encoder.layers.27.self_attention.dense Smoothing module: transformer.encoder.layers.27.mlp.dense_h_to_4h Smoothing module: transformer.encoder.layers.27.mlp.dense_4h_to_h Weights loaded. Total time: 00:05:59 Weights loaded. Total time: 00:06:01 Total time of converting checkpoints: 00:06:49 [TensorRT-LLM] TensorRT-LLM version: 0.9.0 [04/23/2024-06:41:08] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/23/2024-06:41:08] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/23/2024-06:41:08] [TRT-LLM] [I] Set gemm_plugin to float16. [04/23/2024-06:41:08] [TRT-LLM] [I] Set lookup_plugin to None. [04/23/2024-06:41:08] [TRT-LLM] [I] Set lora_plugin to None. [04/23/2024-06:41:08] [TRT-LLM] [I] Set moe_plugin to float16. [04/23/2024-06:41:08] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/23/2024-06:41:08] [TRT-LLM] [I] Set context_fmha to True. [04/23/2024-06:41:08] [TRT-LLM] [I] Set context_fmha_fp32_acc to True. [04/23/2024-06:41:08] [TRT-LLM] [I] Set paged_kv_cache to True. [04/23/2024-06:41:08] [TRT-LLM] [I] Set remove_input_padding to True. [04/23/2024-06:41:08] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/23/2024-06:41:08] [TRT-LLM] [I] Set multi_block_mode to True. [04/23/2024-06:41:08] [TRT-LLM] [I] Set enable_xqa to True. [04/23/2024-06:41:08] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/23/2024-06:41:08] [TRT-LLM] [I] Set tokens_per_block to 128. [04/23/2024-06:41:08] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/23/2024-06:41:08] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/23/2024-06:41:08] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/23/2024-06:41:08] [TRT-LLM] [I] Set multiple_profiles to False. [04/23/2024-06:41:08] [TRT-LLM] [I] Set paged_state to True. [04/23/2024-06:41:08] [TRT-LLM] [I] Set streamingllm to False. [04/23/2024-06:41:08] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/23/2024-06:41:08] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[TensorRT-LLM] TensorRT-LLM version: 0.9.0 [TensorRT-LLM] TensorRT-LLM version: 0.9.0 [04/23/2024-06:41:14] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 401, in load param.value = weights[name] File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value assert v.shape == self._shape, \ AssertionError: The value updated is not the same shape as the original. Updated: (4608, 4096), original: (2304, 4096)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 268, in build_model model = load_model(rank_config, ckpt_dir, model_cls) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1041, in load_model model.load(weights) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 403, in load raise RuntimeError( RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (4608, 4096), original: (2304, 4096)' for parameter 'transformer.layers.0.attention.qkv.weight' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 347, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (4608, 4096), original: (2304, 4096)' for parameter 'transformer.layers.0.attention.qkv.weight' [04/23/2024-06:41:15] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 401, in load param.value = weights[name] File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value assert v.shape == self._shape, \ AssertionError: The value updated is not the same shape as the original. Updated: (4608, 4096), original: (2304, 4096)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 268, in build_model model = load_model(rank_config, ckpt_dir, model_cls) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1041, in load_model model.load(weights) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 403, in load raise RuntimeError( RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (4608, 4096), original: (2304, 4096)' for parameter 'transformer.layers.0.attention.qkv.weight' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 347, in parallel_build future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (4608, 4096), original: (2304, 4096)' for parameter 'transformer.layers.0.attention.qkv.weight' Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 440, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 351, in parallel_build assert len(exceptions AssertionError: Engine building failed, please check error log.

additional notes

none

Tracin commented 5 months ago

If you are not intentionally run SQ in per-tensor mode, please try add --per_token --per_channel

NaNAGISaSA commented 5 months ago

@Tracin hi, I just want to run SQ in per-tensor mode. Will there be any fixes in the near future?

Tracin commented 5 months ago

@Tracin hi, I just want to run SQ in per-tensor mode. Will there be any fixes in the near future?

Sure we are working on that. Will come back soon.

pfk-beta commented 4 months ago

I had the same problem with 0.9.0(without TP). Any updates or fixes in 0.10 or 0.11?