Open LIUKAI0815 opened 4 months ago
@LIUKAI0815 Thanks for the feedback. Could you kindly tell me which model are you using? This requires using the official GPTQ quantized checkpoints from HF.
I have the same issue using a quantized Mistral model : TheBloke/Mistral-7B-v0.1-AWQ
@jershi425 I'm using the Qwen1.5-14B-Chat
Has this problem been solved? I have the same error when using a quantized mixtral model
Has this problem been solved? I have the same error when using a quantized mixtral model
Hi @Mary-Sam could u please list more details/log on your issue? So we can look into it.
Hi @nv-guomingz
I run the following command for the quantized model
python3 /tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir /model --output_dir /engine --load_model_on_cpu
I am using the latest version of tensorrt_llm==0.9.0
My model has the following quantization configuration
{
"bits": 4,
"group_size": 128,
"modules_to_not_convert": [
"gate"
],
"quant_method": "awq",
"version": "gemm",
"zero_point": true
}
And I am getting the following error:
2024-06-03 12:56:17,367 utils.common INFO:[TensorRT-LLM] TensorRT-LLM version: 0.9.0
2024-06-03 12:56:17,367 utils.common INFO:We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
2024-06-03 12:56:17,367 utils.common INFO:0.9.0
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.42it/s]
2024-06-03 12:56:17,367 utils.common INFO:Traceback (most recent call last):
2024-06-03 12:56:17,367 utils.common INFO: File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 446, in <module>
2024-06-03 12:56:17,367 utils.common INFO: main()
2024-06-03 12:56:17,367 utils.common INFO: File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 438, in main
2024-06-03 12:56:17,367 utils.common INFO: convert_and_save_hf(args)
2024-06-03 12:56:17,367 utils.common INFO: File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 375, in convert_and_save_hf
2024-06-03 12:56:17,367 utils.common INFO: execute(args.workers, [convert_and_save_rank] * world_size, args)
2024-06-03 12:56:17,367 utils.common INFO: File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 397, in execute
2024-06-03 12:56:17,367 utils.common INFO: f(args, rank)
2024-06-03 12:56:17,367 utils.common INFO: File "/tensorrt_llm/examples/llama/convert_checkpoint.py", line 362, in convert_and_save_rank
2024-06-03 12:56:17,367 utils.common INFO: llama = LLaMAForCausalLM.from_hugging_face(
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 244, in from_hugging_face
2024-06-03 12:56:17,367 utils.common INFO: llama = convert.from_hugging_face(
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1192, in from_hugging_face
2024-06-03 12:56:17,367 utils.common INFO: weights = load_weights_from_hf(config=config,
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1296, in load_weights_from_hf
2024-06-03 12:56:17,367 utils.common INFO: weights = convert_hf_llama(
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 964, in convert_hf_llama
2024-06-03 12:56:17,367 utils.common INFO: convert_layer(l)
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 646, in convert_layer
2024-06-03 12:56:17,367 utils.common INFO: q_weight = get_weight(model_params, prefix + 'self_attn.q_proj', dtype)
2024-06-03 12:56:17,367 utils.common INFO: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 399, in get_weight
2024-06-03 12:56:17,367 utils.common INFO: if config[prefix + '.weight'].dtype != dtype:
2024-06-03 12:56:17,367 utils.common INFO:KeyError: 'model.layers.0.self_attn.q_proj.weight'
python3 convert_checkpoint.py --model_dir /workspace/lk/model/Qwen/14B --output_dir ./tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_gptq --per_group [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300 0.10.0.dev2024042300 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.45it/s] [04/30/2024-10:16:11] Some parameters are on the meta device device because they were offloaded to the cpu. loading weight in each layer...: 0%| | 0/40 [00:00<?, ?it/s] Traceback (most recent call last): File "/workspace/lk/model/tensorRT/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 365, in
main()
File "/workspace/lk/model/tensorRT/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 357, in main
convert_and_save_hf(args)
File "/workspace/lk/model/tensorRT/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 319, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/workspace/lk/model/tensorRT/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 325, in execute
f(args, rank)
File "/workspace/lk/model/tensorRT/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 305, in convert_and_save_rank
qwen = from_hugging_face(
File "/opt/conda/envs/tensorRT/lib/python3.10/site-packages/tensorrt_llm/models/qwen/convert.py", line 1081, in from_hugging_face
weights = load_from_gptq_qwen(
File "/opt/conda/envs/tensorRT/lib/python3.10/site-packages/tensorrt_llm/models/qwen/weight.py", line 158, in load_from_gptq_qwen
comp_part = model_params[prefix + key_list[0] + comp + suf]
KeyError: 'model.layers.0.self_attn.q_proj.qweight'