NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.03k stars 885 forks source link

Tensor shape mismatch when doing smoothquant for Yi-34B #899

Closed xikaluo closed 7 months ago

xikaluo commented 7 months ago

System Info

-GPU: 4 * 3090(24G) -TensorRT-LLM version: 0.7.1, built from source released last week -TensorRT version: 9.2.0.post12.dev5 -Nvidia Driver: Driver Version: 535.54.03 CUDA Version: 12.2 -OS: Ubuntu 20.04

Who can help?

No response

Information

Tasks

Reproduction

  1. download the code in 'TensorRT-LLM/blob/main/examples/llama'
  2. Run the following script to do smoothquant:
    python3 hf_llama_convert.py \
    --in-file ./models/yi \
    --out-dir ./engines/yi/sq05 \
    --smoothquant 0.5 \
    --tensor-parallelism 1 \
    --storage-type fp16 \
    --processes 1 
  3. Run this script to build engine:
    python build.py \
    --bin_model_dir ./engines/yi/sq05/1-gpu \
    --output_dir ./engines/yi/fp16/4-gpu \
    --max_input_len 8192 \
    --max_output_len 512 \
    --max_batch_size 1 \
    --max_beam_width 1 \
    --dtype float16 \
    --use_gpt_attention_plugin float16 \
    --remove_input_padding \
    --enable_context_fmha_fp32 \
    --use_gemm_plugin float16 \
    --world_size 4 \
    --pp_size 4 \
    --use_smooth_quant \
    --per_channel \
    --per_token

Expected behavior

The engine should be built successfully

actual behavior

Got error msg:

Traceback (most recent call last):
  File "/data/projects/tensorRT-LLM-test/build_engine/llama/build.py", line 983, in <module>
    build(0, args)
  File "/data/projects/tensorRT-LLM-test/build_engine/llama/build.py", line 927, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/data/projects/tensorRT-LLM-test/build_engine/llama/build.py", line 777, in build_rank_engine
    load_from_binary(tensorrt_llm_llama,
  File "/data/miniconda3/envs/TensorRT-LLM-cu122-cpp/lib/python3.10/site-packages/tensorrt_llm/models/llama/weight.py", line 1019, in load_from_binary
    t = fromfile(
  File "/data/miniconda3/envs/TensorRT-LLM-cu122-cpp/lib/python3.10/site-packages/tensorrt_llm/models/llama/weight.py", line 901, in fromfile
    t = t.reshape(shape)
ValueError: cannot reshape array of size 66060288 into shape (7168,8960)

additional notes

I found that this error occured in loading file ./engines/yi/sq05/1-gpu/model.layers.0.attention.query_key_value.weight.int8.col.0.bin

In addition, do smoothquant in TP=4 or TP=1 would both got this error

littletomatodonkey commented 7 months ago

i met the same question, you might need to fix here

https://github.com/NVIDIA/TensorRT-LLM/blob/c89653021e66ca78c55f02b366f404455bc12e8d/tensorrt_llm/models/llama/weight.py#L1011

to

(n_embd // n_groups) // mapping.tp_size * 2) 
Hukongtao commented 7 months ago

TensorRT-LLM/tensorrt_llm/models/llama/weight.py

Thank U. this solved my problem very well