Open jFkd1 opened 1 year ago
Yes. TensorRT-LLM only supports model quantized by AMMO.
@byshiue for GPTQ it seems like we are not supposed to use AMMO though https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/llama#gptq
Is this expected to change in the future?
@jFkd1 I can get AutoGPTQ to work for me using world_size=1
, but I see the same issue when using world_size=4, tp_size=4
. Are you using the same? If so, the fact that the dimension is off by a factor of 4 (world size) then seems suspicious. I am having a similar issue Updated: (1280, 1920), original: (5120, 1920)
which is also off by a factor of 4.
@eycheung @jFkd1 Please try add -1
as the third parameter of this function call. code
If that works I will push an MR soon.
The example recommends using gptq-for-llama
, but the author of gptq-for-llama
recommends using auto_gptq
.
To actually quantize with AMMO, this would be a huge time sink for existing users of QLoRA. (You'd have to quantize int4 with AMMO and relearn QLoRA, and the examples aren't good enough).
Are there any plans to support auto_gptq
?
raised error like this...
[01/29/2024-05:05:25] [TRT-LLM] [I] Loading weights from groupwise GPTQ LLaMA safetensors...
[01/29/2024-05:05:26] [TRT-LLM] [I] Process weights in layer: 0
Traceback (most recent call last):
File "/workspace/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 1092, in <module>
build(0, args)
File "/workspace/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 1037, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/workspace/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 912, in build_rank_engine
tensorrt_llm_llama = get_model_object(args,
File "/workspace/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 786, in get_model_object
load_from_gptq_llama(tensorrt_llm_llama=tensorrt_llm_llama,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1305, in load_from_gptq_llama
process_and_assign_weight(layer.attention.qkv, qkv_weight_list)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1260, in process_and_assign_weight
mOp.weights_scaling_factor.value = scales_fp16.cpu().numpy()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 115, in value
assert v.shape == self._shape, \
AssertionError: The value updated is not the same shape as the original. Updated: (1, 15360), original: (40, 15360)
qwen is supported auto-gptq.... how do that on llama v2?
do u still have further issue or question now? If not, we'll close it soon.
I received the following error when trying to compile an engine for an autogptq quantized Llama-2-13b-chat.
Does this mean auto gpt-q quantized models are not currently supported, and we would have to use ammo?