Tensor shape mismatch when doing smoothquant for Yi-34B

xikaluo commented 7 months ago

System Info

-GPU: 4 * 3090(24G) -TensorRT-LLM version: 0.7.1, built from source released last week -TensorRT version: 9.2.0.post12.dev5 -Nvidia Driver: Driver Version: 535.54.03 CUDA Version: 12.2 -OS: Ubuntu 20.04

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

download the code in 'TensorRT-LLM/blob/main/examples/llama'

Run the following script to do smoothquant:

python3 hf_llama_convert.py \
--in-file ./models/yi \
--out-dir ./engines/yi/sq05 \
--smoothquant 0.5 \
--tensor-parallelism 1 \
--storage-type fp16 \
--processes 1

Run this script to build engine:

python build.py \
--bin_model_dir ./engines/yi/sq05/1-gpu \
--output_dir ./engines/yi/fp16/4-gpu \
--max_input_len 8192 \
--max_output_len 512 \
--max_batch_size 1 \
--max_beam_width 1 \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--remove_input_padding \
--enable_context_fmha_fp32 \
--use_gemm_plugin float16 \
--world_size 4 \
--pp_size 4 \
--use_smooth_quant \
--per_channel \
--per_token

Expected behavior

The engine should be built successfully

actual behavior

Got error msg:

Traceback (most recent call last):
  File "/data/projects/tensorRT-LLM-test/build_engine/llama/build.py", line 983, in <module>
    build(0, args)
  File "/data/projects/tensorRT-LLM-test/build_engine/llama/build.py", line 927, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/data/projects/tensorRT-LLM-test/build_engine/llama/build.py", line 777, in build_rank_engine
    load_from_binary(tensorrt_llm_llama,
  File "/data/miniconda3/envs/TensorRT-LLM-cu122-cpp/lib/python3.10/site-packages/tensorrt_llm/models/llama/weight.py", line 1019, in load_from_binary
    t = fromfile(
  File "/data/miniconda3/envs/TensorRT-LLM-cu122-cpp/lib/python3.10/site-packages/tensorrt_llm/models/llama/weight.py", line 901, in fromfile
    t = t.reshape(shape)
ValueError: cannot reshape array of size 66060288 into shape (7168,8960)

additional notes

I found that this error occured in loading file ./engines/yi/sq05/1-gpu/model.layers.0.attention.query_key_value.weight.int8.col.0.bin

In addition, do smoothquant in TP=4 or TP=1 would both got this error

littletomatodonkey commented 7 months ago

i met the same question, you might need to fix here

https://github.com/NVIDIA/TensorRT-LLM/blob/c89653021e66ca78c55f02b366f404455bc12e8d/tensorrt_llm/models/llama/weight.py#L1011

to

(n_embd // n_groups) // mapping.tp_size * 2)

Hukongtao commented 7 months ago

TensorRT-LLM/tensorrt_llm/models/llama/weight.py

Thank U. this solved my problem very well

NVIDIA / TensorRT-LLM