NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

Error Building Engine for Phi-3 Model: Shape Mismatch in TensorRT-LLM v0.10.0 #1916

Closed BugsBuggy closed 2 months ago

BugsBuggy commented 3 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

I'm using the following image: 24.06-trtllm-python-py3

To convert the model I used this command:

python /data/TensorRT-LLM/examples/phi/convert_checkpoint.py 
    --model_dir /data/Phi-3-medium-4k-instruct \
    --output_dir /data/phi-converted \
    --dtype float16

When building the engine for the Phi-3-medium-4k-instruct model with this instruction (https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/phi/README.md), I get the following error (I removed --max_seq_len as it was not recognized as an argument) :

trtllm-build \
    --checkpoint_dir /data/phi-converted \
    --output_dir /data/phi-engine \
    --gemm_plugin float16
    --max_batch_size 8 \
    --max_input_len 1024 \
    --tp_size 1 \
    --pp_size 1

[TensorRT-LLM] TensorRT-LLM version: 0.10.0
[07/08/2024-14:53:33] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set gemm_plugin to float16.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set nccl_plugin to float16.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set lookup_plugin to None.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set lora_plugin to None.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set moe_plugin to float16.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set context_fmha to True.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set paged_kv_cache to True.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set remove_input_padding to True.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set multi_block_mode to False.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set enable_xqa to True.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set tokens_per_block to 64.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set multiple_profiles to False.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set paged_state to True.
[07/08/2024-14:53:33] [TRT-LLM] [I] Set streamingllm to False.
[07/08/2024-14:53:33] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[07/08/2024-14:53:33] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 452, in load
    param.value = weights[name]
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 126, in value
    assert v.shape == self.shape, \
AssertionError: The value updated is not the same shape as the original. Updated: (7680, 5120), original: (15360, 5120)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 496, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 377, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 336, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 308, in build_model
    model = load_model(rank_config, ckpt_dir, model_cls)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 1160, in load_model
    model.load(weights, from_pruned=is_checkpoint_pruned)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 454, in load
    raise RuntimeError(
RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (7680, 5120), original: (15360, 5120)' for parameter 'transformer.layers.0.attention.qkv.weight'

Expected behavior

I would expect the engine to be built without any problems because I'm using the same commands as in the example.

actual behavior

AssertionError appears as some shapes do not match.

additional notes

I also get a shape mismatch error when using the microsoft/Phi-3-medium-128k-instruct model. It's basically the same problem

QiJune commented 2 months ago

Hi @BugsBuggy , do you install the requirements.txt of phi example? https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/phi/requirements.txt

BugsBuggy commented 2 months ago

I installed the wrong requirements, thank you!