NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.23k stars 913 forks source link

KeyError: 'model.layers.0.self_attn.q_proj.qweight' #1528

Open LIUKAI0815 opened 4 months ago

LIUKAI0815 commented 4 months ago

python3 convert_checkpoint.py --model_dir /workspace/lk/model/Qwen/14B --output_dir ./tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_gptq --per_group [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300 0.10.0.dev2024042300 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.45it/s] [04/30/2024-10:16:11] Some parameters are on the meta device device because they were offloaded to the cpu. loading weight in each layer...: 0%| | 0/40 [00:00<?, ?it/s] Traceback (most recent call last): File "/workspace/lk/model/tensorRT/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 365, in main() File "/workspace/lk/model/tensorRT/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 357, in main convert_and_save_hf(args) File "/workspace/lk/model/tensorRT/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 319, in convert_and_save_hf execute(args.workers, [convert_and_save_rank] * world_size, args) File "/workspace/lk/model/tensorRT/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 325, in execute f(args, rank) File "/workspace/lk/model/tensorRT/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 305, in convert_and_save_rank qwen = from_hugging_face( File "/opt/conda/envs/tensorRT/lib/python3.10/site-packages/tensorrt_llm/models/qwen/convert.py", line 1081, in from_hugging_face weights = load_from_gptq_qwen( File "/opt/conda/envs/tensorRT/lib/python3.10/site-packages/tensorrt_llm/models/qwen/weight.py", line 158, in load_from_gptq_qwen comp_part = model_params[prefix + key_list[0] + comp + suf] KeyError: 'model.layers.0.self_attn.q_proj.qweight'

jershi425 commented 4 months ago

@LIUKAI0815 Thanks for the feedback. Could you kindly tell me which model are you using? This requires using the official GPTQ quantized checkpoints from HF.

RoslinAdama commented 4 months ago

I have the same issue using a quantized Mistral model : TheBloke/Mistral-7B-v0.1-AWQ

LIUKAI0815 commented 4 months ago

@jershi425 I'm using the Qwen1.5-14B-Chat

Mary-Sam commented 3 months ago

Has this problem been solved? I have the same error when using a quantized mixtral model

nv-guomingz commented 3 months ago

Has this problem been solved? I have the same error when using a quantized mixtral model

Hi @Mary-Sam could u please list more details/log on your issue? So we can look into it.

Mary-Sam commented 3 months ago

Hi @nv-guomingz
I run the following command for the quantized model python3 /tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir /model --output_dir /engine --load_model_on_cpu

I am using the latest version of tensorrt_llm==0.9.0

My model has the following quantization configuration

{
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": [
      "gate"
    ],
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  }

And I am getting the following error:

2024-06-03 12:56:17,367 utils.common INFO:[TensorRT-LLM] TensorRT-LLM version: 0.9.0
2024-06-03 12:56:17,367 utils.common INFO:We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
2024-06-03 12:56:17,367 utils.common INFO:0.9.0
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.42it/s]
2024-06-03 12:56:17,367 utils.common INFO:Traceback (most recent call last):
2024-06-03 12:56:17,367 utils.common INFO:  File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 446, in <module>
2024-06-03 12:56:17,367 utils.common INFO:    main()
2024-06-03 12:56:17,367 utils.common INFO:  File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 438, in main
2024-06-03 12:56:17,367 utils.common INFO:    convert_and_save_hf(args)
2024-06-03 12:56:17,367 utils.common INFO:  File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 375, in convert_and_save_hf
2024-06-03 12:56:17,367 utils.common INFO:    execute(args.workers, [convert_and_save_rank] * world_size, args)
2024-06-03 12:56:17,367 utils.common INFO:  File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 397, in execute
2024-06-03 12:56:17,367 utils.common INFO:    f(args, rank)
2024-06-03 12:56:17,367 utils.common INFO:  File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 362, in convert_and_save_rank
2024-06-03 12:56:17,367 utils.common INFO:    llama = LLaMAForCausalLM.from_hugging_face(
2024-06-03 12:56:17,367 utils.common INFO:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 244, in from_hugging_face
2024-06-03 12:56:17,367 utils.common INFO:    llama = convert.from_hugging_face(
2024-06-03 12:56:17,367 utils.common INFO:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1192, in from_hugging_face
2024-06-03 12:56:17,367 utils.common INFO:    weights = load_weights_from_hf(config=config,
2024-06-03 12:56:17,367 utils.common INFO:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1296, in load_weights_from_hf
2024-06-03 12:56:17,367 utils.common INFO:    weights = convert_hf_llama(
2024-06-03 12:56:17,367 utils.common INFO:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 964, in convert_hf_llama
2024-06-03 12:56:17,367 utils.common INFO:    convert_layer(l)
2024-06-03 12:56:17,367 utils.common INFO:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 646, in convert_layer
2024-06-03 12:56:17,367 utils.common INFO:    q_weight = get_weight(model_params, prefix + 'self_attn.q_proj', dtype)
2024-06-03 12:56:17,367 utils.common INFO:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 399, in get_weight
2024-06-03 12:56:17,367 utils.common INFO:    if config[prefix + '.weight'].dtype != dtype:
2024-06-03 12:56:17,367 utils.common INFO:KeyError: 'model.layers.0.self_attn.q_proj.weight'